Spaces:

snehasquasher
/

spur-chatbot

Sleeping

App Files Files Community

snehasquasher commited on Aug 27, 2023

Commit

b0d4092

1 Parent(s): c9535bf

Upload folder using huggingface_hub

Browse files

Files changed (48) hide show

.github/workflows/huggingface.yaml +20 -0
.github/workflows/update_space.yml +28 -0
.gitignore +11 -0
Constants.py +11 -0
LICENSE +21 -0
README.md +33 -8
__pycache__/Constants.cpython-310.pyc +0 -0
__pycache__/apiKey.cpython-310.pyc +0 -0
__pycache__/db_types.cpython-310.pyc +0 -0
__pycache__/ingest_data.cpython-310.pyc +0 -0
__pycache__/metadatainfo.cpython-310.pyc +0 -0
__pycache__/notionMetadataInfo.cpython-310.pyc +0 -0
__pycache__/query_data.cpython-310.pyc +0 -0
__pycache__/read_notion.cpython-310.pyc +0 -0
__pycache__/utilities.cpython-310.pyc +0 -0
apiKey.py +2 -0
app.py +215 -0
assets/logo/logo.jpg +0 -0
blogpost.md +330 -0
chromaclient.py +17 -0
chromadb/cd6d665c-a1d1-4b9b-b8a5-cfa0f731d4d0/length.bin +3 -0
cli_app.py +34 -0
data/Evan Cover.docx +0 -0
data/Josua Krause.docx +0 -0
data/Navid.docx +0 -0
data/Neal Patel.docx +0 -0
data/Siva_values.docx +0 -0
data/Tanmay Chopra.docx +0 -0
data_back/.DS_Store +0 -0
data_back/Evan Cover.docx +0 -0
data_back/Josua Krause.docx +0 -0
data_back/Navid.docx +0 -0
data_back/Neal Patel.docx +0 -0
data_back/Siva_values.docx +0 -0
data_back/Tanmay Chopra.docx +0 -0
db_types.py +6 -0
ingest_data.py +242 -0
logs/output.log +7 -0
metadatainfo.py +27 -0
myvectorstore.pkl +3 -0
notionMetadataInfo.py +23 -0
notiondb/chroma.sqlite3 +0 -0
old/app_copy.py +140 -0
query_data.py +212 -0
read_notion.py +47 -0
requirements.txt +7 -0
utilities.py +31 -0
vectorstore.pkl +3 -0

.github/workflows/huggingface.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push https://snehasquasher:$HF_TOKEN@huggingface.co/spaces/snehasquasher/spur-mvp main

.github/workflows/update_space.yml ADDED Viewed

	@@ -0,0 +1,28 @@

+name: Run Python script
+on:
+  push:
+    branches:
+      - main
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.9'
+    - name: Install Gradio
+      run: python -m pip install gradio
+    - name: Log in to Hugging Face
+      run: python -c 'import huggingface_hub; huggingface_hub.login(token="${{ secrets.hf_token }}")'
+    - name: Deploy to Spaces
+      run: gradio deploy

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+apiKey.py
+logs
+logs/*.*
+__pycache__
+__pycache__/*
+*.pyc
+chromadb/*
+data
+notiondb
+.DS_Store
+chroma.sqlite3

Constants.py ADDED Viewed

	@@ -0,0 +1,11 @@

+DB_TYPE ="notion"  #faiss or chromadb or notion
+PERSIST_DIRECTORY="./"
+CHROMA_PERSIST_DIRECTORY="./chromadb"
+NOTION_PERSIST_DIRECTORY="./notiondb"
+COLLECTION_NAME="chatdata"
+CHROMA_COLLECTION_NAME="LLMData"
+NOTION_COLLECTION_NAME="notionData"
+DATA_DIRECTORY="./data"
+LOG_FILE="./logs/output.log"
+NOTION_DB="0c3bfaa0a33c4038aeeb988c16f83abb"
+#MAX_PAGES_TO_READ=

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Harrison Chase
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,37 @@
 ---
-title: Spur Chatbot
-emoji: 🏃
-colorFrom: yellow
-colorTo: green
-sdk: gradio
-sdk_version: 3.41.2
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: spur-chatbot
 app_file: app.py
+sdk: gradio
+sdk_version: 3.40.1
 ---
+# Chat-Your-Data
+Create a ChatGPT like experience over your custom docs using [LangChain](https://github.com/langchain-ai/langchain).
+See [this blog post](blogpost.md) for a more detailed explanation.
+## Step 0: Install requirements
+`pip install -r requirements.txt`
+## Step 1: Set your open AI Key
+```sh
+export OPENAI_API_KEY="Your OpenAPI Key"
+```
+## Step 2: Query data
+Custom prompts are used to ground the answers in the state of the union text file.
+## Step 3: Running the Application
+By running `python app.py` from the command line you can easily interact with your ChatGPT over your own data.
+# Others
+## Notion Integration
+## Step 1: Set your Notion API Key
+```sh
+export NOTION_API_KEY= "Your Notion API Key"
+```

__pycache__/Constants.cpython-310.pyc ADDED Viewed

Binary file (389 Bytes). View file

__pycache__/apiKey.cpython-310.pyc ADDED Viewed

Binary file (226 Bytes). View file

__pycache__/db_types.cpython-310.pyc ADDED Viewed

Binary file (417 Bytes). View file

__pycache__/ingest_data.cpython-310.pyc ADDED Viewed

Binary file (3.76 kB). View file

__pycache__/metadatainfo.cpython-310.pyc ADDED Viewed

Binary file (553 Bytes). View file

__pycache__/notionMetadataInfo.cpython-310.pyc ADDED Viewed

Binary file (450 Bytes). View file

__pycache__/query_data.cpython-310.pyc ADDED Viewed

Binary file (6.79 kB). View file

__pycache__/read_notion.cpython-310.pyc ADDED Viewed

Binary file (1.03 kB). View file

__pycache__/utilities.cpython-310.pyc ADDED Viewed

Binary file (921 Bytes). View file

apiKey.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ OPENAI_API_KEY="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"
2	+ NOTION_API_KEY="secret_OfFuUM9P85js07guQiRpyVFGiNMvTgQf57QBOBOsB94"

app.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import os
+import sys
+from typing import Optional, Tuple
+from threading import Lock
+import json
+import shutil
+import gradio as gr
+from query_data import chain_options
+from query_data import get_basic_qa_chain
+from zipfile import ZipFile
+from ingest_data import ingestData
+from query_data import (get_basic_qa_chain,
+                       get_qa_with_sources_chain,
+                       get_custom_prompt_qa_chain,
+                       get_condense_prompt_qa_chain,
+                       get_retrievalqa_with_sources_chain)
+from metadatainfo import metadata_field_info
+from Constants import *
+from apiKey import *
+def set_openai_api_key(api_key: str):
+    """Set the api key and return chain.
+    If no api_key, then None is returned.
+    """
+    if api_key:
+        os.environ["OPENAI_API_KEY"] = api_key
+        chain = getChainSelectedByUser(chainType)
+        os.environ["OPENAI_API_KEY"] = ""
+        return chain
+    '''
+    os.environ["OPENAI_API_KEY"] = api_key
+    chain=get_basic_qa_chain()
+    return chain'''
+def getChainSelectedByUser(chainType: gr.Dropdown) :
+    chain = get_basic_qa_chain()
+    if (chainType == "with_sources" ):
+        chain = get_qa_with_sources_chain()
+    elif (chainType == "custom_prompt"):
+        chain = get_custom_prompt_qa_chain()
+    elif (chainType == "condense_prompt"):
+        chain = get_condense_prompt_qa_chain()
+    elif (chainType == "retrieval_sources_chain"):
+        chain = get_retrievalqa_with_sources_chain()
+    return chain
+class Logger:
+    def __init__(self, filename):
+        self.terminal = sys.stdout
+        self.log = open(filename, "w")
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+    def isatty(self):
+        return False
+sys.stdout = Logger(LOG_FILE)
+def read_logs():
+    sys.stdout.flush()
+    with open(LOG_FILE, "r") as f:
+        return f.read()
+def upload_file(files):
+    file_paths = [file.name for file in files]
+    for f in file_paths:
+        print("moving file :" + f)
+        shutil.copy(f, DATA_DIRECTORY)
+    return file_paths
+def ingest():
+    ingestData()
+class ChatWrapper:
+    def __init__(self):
+        self.lock = Lock()
+    def __call__(
+        self, api_key: str, inp: str, history: Optional[Tuple[str, str]], chain, chainType
+    ):
+        """Execute the chat functionality."""
+        self.lock.acquire()
+        try:
+            history = history or []
+            # If chain is None, that is because no API key was provided.
+            if chain is None:
+                '''os.environ["OPENAI_API_KEY"] = api_key
+                chain=get_basic_qa_chain()'''
+                history.append((inp, "Please paste your OpenAI key to use"))
+                return history, history
+            # Set OpenAI key
+            import openai
+            openai.api_key = api_key
+            print("calling chain of type " + str(type(chain)))
+            # Run chain and append input.
+            results = chain({"question": inp})
+                            #metadata=metadata_field_info,
+                           #include_run_info=True)
+            print("result keys :")
+            print(*results, sep=" " )
+            output = results["answer"]
+            if (chainType == "with_sources") :
+                print("document source count :"+str(len(results["source_documents"])))
+                for s in results["source_documents"]:
+                    for key in s.metadata:
+                        output = output + "<br>" + key + ":"+ s.metadata[key] + "<br>"
+            elif (chainType == "retrieval_sources_chain"):
+                    print("results")
+                    #output = output + "<br>" + "SOURCE:" + results["sources"]
+            history.append((inp, output))
+        except Exception as e:
+            raise e
+        finally:
+            self.lock.release()
+        return history, history
+chat = ChatWrapper()
+block = gr.Blocks(gr.themes.Soft(),
+                  analytics_enabled=True)
+with block :
+    with gr.Row():
+        #api_key=OPENAI_API_KEY
+        gr.Markdown(
+            "<h3><center>Chat-Your-Data</center></h3>")
+        openai_api_key_textbox = gr.Textbox(
+            #value=api_key,
+            placeholder="Paste your OpenAI API key (sk-...)",
+            show_label=False,
+            lines=1,
+            type="password",
+        )
+        #set_openai_api_key(api_key)
+    chatbot = gr.Chatbot()
+    with gr.Row():
+        message = gr.Textbox(
+            value="ask me something about your data",
+            label="What's your question?",
+            placeholder="Ask questions about the most recent state of the union",
+            lines=1,
+        )
+        submit = gr.Button(value="Send", variant="secondary").style(
+            scale=1)
+    gr.Examples(
+        examples=[
+            "Who is Tanmay Chopra?",
+            "Which persons know about the topics LLM?",
+            "What did Navid say about LLM?",
+        ],
+        inputs=message,
+    )
+    with gr.Row():
+        chainType = gr.Dropdown(list(chain_options.keys()),
+                                label="Chain Type",  value="basic"
+                      )
+    with gr.Accordion(label="show_logs"):
+        logs = gr.Textbox(label="Console")
+        block.load(read_logs, None, logs, every=1)
+    file_output = gr.File()
+    upload_button = gr.UploadButton("Click to Upload a File", file_types=[".docx", ".pdf",".txt",".json"], file_count="multiple")
+    files = upload_button.upload(upload_file, upload_button, file_output)
+    # gr.Gallery(files)
+    btn = gr.Button(value="Ingest")
+    btn.click(ingest)
+    gr.HTML("Demo application of a LangChain chain.")
+    gr.HTML(
+        "<center>Powered by <a href='https://github.com/hwchase17/langchain'>LangChain 🦜️🔗</a></center>"
+    )
+    state = gr.State()
+    agent_state = gr.State()
+    submit.click(chat, inputs=[openai_api_key_textbox,message,
+                 state, agent_state, chainType], outputs=[chatbot, state])
+    message.submit(chat, inputs=[
+                   openai_api_key_textbox, message, state, agent_state, chainType], outputs=[chatbot, state])
+    openai_api_key_textbox.change(
+        set_openai_api_key,
+        inputs=[openai_api_key_textbox],
+        outputs=[agent_state],
+    )
+    chainType.change(
+            getChainSelectedByUser,
+            inputs=[chainType],
+            outputs=[agent_state],
+        )
+block.queue().launch(debug=True)

assets/logo/logo.jpg ADDED Viewed

blogpost.md ADDED Viewed

	@@ -0,0 +1,330 @@

+**_Note: See the accompanying GitHub repo for this blogpost [here](https://github.com/hwchase17/chat-your-data)._**
+**Note: Last updated by [Bill Chambers](http://billchambers.me/). August, 2023.**
+ChatGPT has taken the world by storm. But while it’s great for general purpose knowledge, it only knows information about what it has been trained on, which is pre-2021 generally available internet data. It doesn’t know about your private data nor does it know recent sources of data.
+Wouldn’t it be useful if it did?
+This blog post is a tutorial on how to set up your own version of ChatGPT over a specific corpus of data. There is an [accompanying GitHub repo](https://github.com/hwchase17/chat-your-data) that has the relevant code referenced in this post. Specifically, this deals with text data. For how to interact with other sources of data with a natural language layer, see the below tutorials:
+*   [SQL Database](https://python.langchain.com/docs/modules/chains/popular/sqlite)
+*   [APIs](https://python.langchain.com/docs/modules/chains/popular/api)
+## High Level Overview
+At a high level, there are two components to setting up ChatGPT over your own data: (1) ingestion of the data, (2) chatbot over the data. Let's talk a bit about the steps involved in each of those.
+### Ingestion of data
+![Diagram of ingestion process](https://blog.langchain.dev/content/images/2023/02/ingest.png)
+Ingestion involves several steps. The steps are:
+1.  **Load data sources to text**: this involves loading your data from arbitrary sources to text in a form that it can be used downstream. This is one place where we hope the community will help out!
+2.  **Chunk text**: this involves chunking the loaded text into smaller chunks. This is necessary because language models generally have a limit to the amount of text (tokens) they can deal with. "Chunk size" is something to be tuned over time.
+3.  **Embed text**: this involves creating a numerical embedding for each chunk of text. This is necessary because we only want to select the most relevant chunks of text for a given question, and we will do this by finding the most similar chunks in the embedding space.
+4.  **Load embeddings to vectorstore**: this involves putting embeddings and documents into a vectorstore. Vectorstores help us find the most similar chunks in the embedding space quickly and efficiently.
+Langchain strives to be modular, so that each of these steps are straightforward to swap out with other components or approaches.
+### Querying of Data
+![Diagram of query process](https://blog.langchain.dev/content/images/2023/02/query.png)
+This can also be broken down into a few steps. The high level steps are:
+1.  **Get input from the user**: we'll use a web interface and a cli interface to receive input from the user about the documents.
+2.  **Combine that input with chat history**: we'll combine chat history and a new question into a single standalone question. This is often necessary because we want to allow for the ability to ask follow up questions (an important UX consideration).
+3.  **Lookup relevant documents**: using the vectorstore created during ingestion, we will look up relevant documents for the answer.
+4.  **Generate a response**: Given the standalone question and the relevant documents, we will use a language model to generate a response.
+In this post, we'll explore some design decisions you have with history, prompts, and the chat experience. We won't touch on the deployment, but for more information see [our deployment guide](https://python.langchain.com/docs/guides/deployments/).
+## Step by Step Details
+This section dives into more detail on the steps necessary to ingest data.
+![Diagram of ingestion process](https://blog.langchain.dev/content/images/2023/02/ingest-1.png)
+### Load data
+First, we need to load data into a standard format. In langchain, a [`Document`](https://docs.langchain.com/docs/components/schema/document) consists of (1) the text itself, (2) any metadata associated with that text (where it came from, etc). This is often critical for understanding and communicating the context for testing or for the end user.
+The community has contributed dozens of document loaders and we look forward to seeing more and more join the community.  [See our documention (and over 120 data loaders) for more information about document loaders](https://python.langchain.com/docs/integrations/document_loaders/). Please open a pull request or file an issue if you'd like to contribute (or request) a new document loader.
+The line below contains the line of code responsible for loading the relevant documents.
+```py
+print("Loading data...")
+loader = UnstructuredFileLoader("state_of_the_union.txt")
+raw_documents = loader.load()
+```
+### Split Text
+Splitting documents into smaller units of text for input into the model is critical for getting relevant information back from our chatbot. When documents are too big, you'll include irrelevant information to the model. Conversely, when they're too small, you'll not include enough information and the model may be confused about what is actually relevant.
+The chunk size isn't quite a science, so you'll have to experiment to see if you can get good results.
+```py
+print("Splitting text...")
+text_splitter = CharacterTextSplitter(
+    separator="\n\n",
+    chunk_size=600,
+    chunk_overlap=100,
+    length_function=len,
+)
+documents = text_splitter.split_documents(raw_documents)
+```
+### Create embeddings and store in vectorstore
+Next, now that we have small chunks of text we need to create embeddings for each piece of text and store them in a vectorstore. We create embeddings because this is an efficient way of storing this text data and subsequently querying the store for documents relevant to our query.
+Here we use OpenAI’s embeddings and a [FAISS vectorstore](https://faiss.ai/index.html) and store that as a python pickle file for later use.
+```py
+print("Creating vectorstore...")
+embeddings = OpenAIEmbeddings()
+vectorstore = FAISS.from_documents(documents, embeddings)
+with open("vectorstore.pkl", "wb") as f:
+    pickle.dump(vectorstore, f)
+```
+Run `python ingest_data.py` to create the vectorstore. This is necessary after changing how you split the text or loading new documents. If you're making changes, adding documents, or splitting text different, you'll have to re-run things.
+## Query data
+So now that we’ve ingested the data, we can now use it in a chatbot interface. In order to do this, we will use the [ConversationalRetrievalChain](https://python.langchain.com/docs/use_cases/question_answering/how_to/chat_vector_db).
+![Diagram of ConversationalRetrievalChain](https://blog.langchain.dev/content/images/2023/02/query-1.png)
+There are several different options when it comes to querying the data. Do you allow follow up questions? Want to include other user context? There are lots of design decisions and below we'll discuss some of the most critical.
+### Do you want to have conversation history?
+This is table stakes from a UX perspective because it allows for follow up questions. Adding memory is simple, you can either use a built in module.
+```py
+llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+retriever = load_retriever()
+memory = ConversationBufferMemory(
+    memory_key="chat_history", return_messages=True)
+# model = RetrievalQA.from_llm(llm=llm, retriever=retriever)
+# if you don't want memory use the above, you will have to change
+# the app.py or cli_app.py file to include `query` in the input instead of `question`
+model = ConversationalRetrievalChain.from_llm(
+    llm=llm,
+    retriever=retriever,
+    memory=memory)
+```
+Alternatively, you can specify memory and pass it into the model, tracking it on your own. Run this example from the github repo with the following, then read the code in `query_data.py`.
+```sh
+python cli_app.py
+Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic):
+Chat with your docs!
+---------------
+Your Question:  (what did the president say about ketanji brown?):
+Answer: The President nominated Ketanji Brown Jackson to serve on the United States Supreme Court, describing her as one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence. He also mentioned that she
+is a former top litigator in private practice, a former federal public defender, and comes from a family of public school educators and police officers. He referred to her as a consensus builder and noted that since her nomination, she
+has received a broad range of support from various groups, including the Fraternal Order of Police and former judges appointed by both Democrats and Republicans.
+---------------
+```
+### Do you want to customize the QA prompt?
+You can easily customize the QA prompt by passing in a prompt of your choice. This is similar in experience to most all chains in langchain. [Learn more about custom prompts here.](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa#return-source-documents)
+```py
+template = """You are an AI assistant for answering questions about the most recent state of the union address.
+You are given the following extracted parts of a long document and a question. Provide a conversational answer.
+If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
+If the question is not about the most recent state of the union, politely inform them that you are tuned to only answer questions about the most recent state of the union.
+Lastly, answer the question as if you were a pirate from the south seas and are just coming back from a pirate expedition where you found a treasure chest full of gold doubloons.
+Question: {question}
+=========
+{context}
+=========
+Answer in Markdown:"""
+QA_PROMPT = PromptTemplate(template=template, input_variables=[
+                           "question", "context"])
+llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+retriever = load_retriever()
+memory = ConversationBufferMemory(
+    memory_key="chat_history", return_messages=True)
+model = ConversationalRetrievalChain.from_llm(
+    llm=llm,
+    retriever=retriever,
+    memory=memory,
+    combine_docs_chain_kwargs={"prompt": QA_PROMPT})
+```
+Run this example from the github repo with the following, then read the code in `query_data.py`.
+```sh
+python cli_app.py
+Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic): custom_prompt
+Chat with your docs!
+---------------
+Your Question:  (what did the president say about ketanji brown?):
+Answer: Arr matey, the cap'n, I mean the President, he did speak of Ketanji Brown Jackson, he did. He nominated her to the United States Supreme Court, he did, just 4 days before his address. He spoke highly of her, he did, callin' her
+one of the nation's top legal minds. He believes she'll continue Justice Breyer’s legacy of excellence, he does.
+She's been a top litigator in private practice, a federal public defender, and comes from a family of public school educators and police officers. She's a consensus builder, she is. Since her nomination, she's received support from all
+over, from the Fraternal Order of Police to former judges appointed by both Democrats and Republicans. So, that's what the President had to say about Ketanji Brown Jackson, it is.
+---------------
+Your Question:  (what did the president say about ketanji brown?): who did she succeed?
+Answer: Arr matey, ye be askin' about who Judge Ketanji Brown Jackson be succeedin'. From the words of the President himself, she be takin' over from Justice Breyer, continuin' his legacy of excellence on the United States Supreme
+Court. Now, let's get back to countin' me gold doubloons, aye?
+---------------
+```
+### Do you expect long conversations?
+If so, you're going to want to condense previous questions and history in order to add context into the prompt. If you embed the whole chat history along with the new question to look up relevant documents, you may pull in documents no longer relevant to the conversation (if the new question is not related at all). Therefor, this step of condensing the chat history and a new question to a standalone question is very important.
+```py
+_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
+You can assume the question about the most recent state of the union address.
+Chat History:
+{chat_history}
+Follow Up Input: {question}
+Standalone question:"""
+CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
+llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+retriever = load_retriever()
+memory = ConversationBufferMemory(
+    memory_key="chat_history", return_messages=True)
+# see: https://github.com/langchain-ai/langchain/issues/5890
+model = ConversationalRetrievalChain.from_llm(
+    llm=llm,
+    retriever=retriever,
+    memory=memory,
+    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
+    combine_docs_chain_kwargs={"prompt": QA_PROMPT}) # includes the custom prompt as well
+```
+Read the code in `query_data.py` for some example code to apply to your own projects.
+### Do you want the model to cite sources?
+[Langchain can cite source documents in the model.](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa#return-source-documents). There's a lot you can do here, you can add your own metadata, your own sections, and other relevant information to return the most relevant metadata for your query.
+```py
+llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+retriever = load_retriever()
+history = []
+model = ConversationalRetrievalChain.from_llm(
+    llm=llm,
+    retriever=retriever,
+    return_source_documents=True)
+def model_func(question):
+    # bug: this doesn't work with the built-in memory
+    # see: https://github.com/langchain-ai/langchain/issues/5630
+    new_input = {"question": question['question'], "chat_history": history}
+    result = model(new_input)
+    history.append((question['question'], result['answer']))
+    return result
+model_func({"question":"some question you have"})
+# this is the same interface as all the other models.
+```
+Run this example from the github repo with the following, then read the code in `query_data.py`.
+```sh
+python cli_app.py
+Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic): with_sources
+Chat with your docs!
+---------------
+Your Question:  (what did the president say about ketanji brown?):
+Answer: The President nominated Ketanji Brown Jackson to serve on the United States Supreme Court, describing her as one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence. He also mentioned that she
+is a former top litigator in private practice, a former federal public defender, and comes from a family of public school educators and police officers. Since her nomination, she has received a broad range of support, including from the
+Fraternal Order of Police and former judges appointed by both Democrats and Republicans.
+Sources:
+state_of_the_union.txt
+One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
+And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
+state_of_the_union.txt
+As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
+While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military
+justice.
+state_of_the_union.txt
+But in my administration, the watchdogs have been welcomed back.
+We’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.
+And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.
+By the end of this year, the deficit will be down to less than half what it was before I took office.
+The only president ever to cut the deficit by more than one trillion dollars in a single year.
+Lowering your costs also means demanding more competition.
+state_of_the_union.txt
+A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of
+support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.
+And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.
+We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
+---------------
+Your Question:  (what did the president say about ketanji brown?): where did she work before?
+Answer: Before her nomination to the United States Supreme Court, Ketanji Brown Jackson worked as a Circuit Court of Appeals Judge. She was also a former top litigator in private practice and a former federal public defender.
+Sources:
+state_of_the_union.txt
+One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
+And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
+state_of_the_union.txt
+A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of
+support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.
+And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.
+We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
+state_of_the_union.txt
+We cannot let this happen.
+Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
+Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for
+your service.
+state_of_the_union.txt
+Vice President Harris and I ran for office with a new economic vision for America.
+Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up and the middle out, not from the top down.
+Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well.
+America used to have the best roads, bridges, and airports on Earth.
+Now our infrastructure is ranked 13th in the world.
+We won’t be able to compete for the jobs of the 21st Century if we don’t fix that.
+---------------
+```
+### Language Model
+The final lever to pull is what language model you use to power your chatbot. In our example we use the OpenAI LLM, but this can easily be substituted to other language models that LangChain supports, or you can even write your own wrapper.
+## Putting it all together
+After making all the necessary customizations, and running `python ingest_data.py`, you can now interact with the chatbot.
+We’ve exposed a really simple interface through which you can do. You can access this just by running `python cli_app.py` and this will open in the terminal a way to ask questions and get back answers. Try it out!
+We also have an example of deploying this app via Gradio! You can do so by running `python app.py`. This can also easily be deployed to Hugging Face spaces - see [example space here](https://huggingface.co/spaces/hwchase17/chat-your-data-state-of-the-union).
+![langchain hugging face spaces](https://blog.langchain.dev/content/images/2023/02/Screen-Shot-2023-02-07-at-9.01.42-AM.png)

chromaclient.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import chromadb
+import openai
+from Constants import *
+from langchain.embeddings import OpenAIEmbeddings
+import os
+from langchain.vectorstores import Chroma
+openai.api_key = OPENAI_API_KEY
+os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
+#chroma_client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIRECTORY, embeddings=OpenAIEmbeddings())
+vectorstore = Chroma(persist_directory=CHROMA_PERSIST_DIRECTORY,collection_name=CHROMA_COLLECTION_NAME,embedding_function=OpenAIEmbeddings())
+print("Chroma collection count : " + str(vectorstore._collection.count()))
+#collection = chroma_client.get_collection(name="myname")
+#results = collection.query(query_texts=["Who is Tanmay"], n_results=10)
+#print(collection.get(include=["embeddings", "documents", "metadatas"]))
+#collection.get(include=["embeddings", "documents", "metadatas"])

chromadb/cd6d665c-a1d1-4b9b-b8a5-cfa0f731d4d0/length.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fcefd9550ec0a4b4cf0550e48b6589a991eaf7590c089fdaafa4298dab5c6f90
+size 4000

cli_app.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import os
+from query_data import chain_options
+from rich.console import Console
+from rich.prompt import Prompt
+from Constants import *
+from apiKey import *
+os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
+if __name__ == "__main__":
+    c = Console()
+    model = Prompt.ask("Which QA model would you like to work with?",
+                       choices=list(chain_options.keys()),
+                       default="basic")
+    chain = chain_options[model]()
+    c.print("[bold]Chat with your docs!")
+    c.print("[bold red]---------------")
+    while True:
+        default_question = "what did the president say about ketanji brown?"
+        question = Prompt.ask("Your Question: ", default=default_question)
+        # change this line if you're using RetrievalQA
+        # input = query
+        # output = result
+        result = chain({"question": question})
+        c.print("[green]Answer: [/green]" + result['answer'])
+        # include a bit more if we're using `with_sources`
+        if model == "with_sources" and result.get('source_documents', None):
+            c.print("[green]Sources: [/green]")
+            for doc in result['source_documents']:
+                c.print(f"[bold underline green]{doc.metadata['source']}")
+                c.print("[green]" + doc.page_content)
+        c.print("[bold red]---------------")

data/Evan Cover.docx ADDED Viewed

Binary file (7.56 kB). View file

data/Josua Krause.docx ADDED Viewed

Binary file (8.33 kB). View file

data/Navid.docx ADDED Viewed

Binary file (13.6 kB). View file

data/Neal Patel.docx ADDED Viewed

Binary file (7.36 kB). View file

data/Siva_values.docx ADDED Viewed

Binary file (15.7 kB). View file

data/Tanmay Chopra.docx ADDED Viewed

Binary file (13.1 kB). View file

data_back/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

data_back/Evan Cover.docx ADDED Viewed

Binary file (7.56 kB). View file

data_back/Josua Krause.docx ADDED Viewed

Binary file (8.33 kB). View file

data_back/Navid.docx ADDED Viewed

Binary file (13.6 kB). View file

data_back/Neal Patel.docx ADDED Viewed

Binary file (7.36 kB). View file

data_back/Siva_values.docx ADDED Viewed

Binary file (15.7 kB). View file

data_back/Tanmay Chopra.docx ADDED Viewed

Binary file (13.1 kB). View file

db_types.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from enum import Enum
+class DBTypes(Enum):
+    CHROMA="chromadb"
+    FAISS="faiss"
+    NOTION="notion"

ingest_data.py ADDED Viewed

	@@ -0,0 +1,242 @@

+import os
+import openai
+from langchain.text_splitter import CharacterTextSplitter
+from langchain.document_loaders import UnstructuredFileLoader
+from langchain.vectorstores.faiss import FAISS
+from langchain.embeddings import OpenAIEmbeddings
+from langchain.document_loaders import DirectoryLoader
+from langchain.document_loaders import TextLoader
+from langchain.document_loaders import CSVLoader
+from langchain.document_loaders import PyPDFLoader
+from langchain.document_loaders import UnstructuredWordDocumentLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
+from langchain.vectorstores import Chroma
+from langchain.document_loaders import NotionDBLoader
+from langchain.vectorstores.utils import filter_complex_metadata
+import pickle
+from Constants import *
+from apiKey import *
+from db_types import *
+from utilities import transform_complex_metadata
+def createChromaFromNotiondb(documents, embeddings) :
+    vectordb = Chroma(persist_directory=NOTION_PERSIST_DIRECTORY, embedding_function=embeddings,
+                      collection_name=NOTION_COLLECTION_NAME)
+    print("Checking for existing collection count "+str(vectordb._collection.count()))
+    if (vectordb._collection.count()== 0):
+        print("Transforming notion collection "+ NOTION_COLLECTION_NAME)
+        documents = transform_complex_metadata(documents)
+        print("Creating notion database")
+        vectordb = Chroma.from_documents(documents=documents, embedding=embeddings, persist_directory=NOTION_PERSIST_DIRECTORY, collection_name=NOTION_COLLECTION_NAME)
+        vectordb.persist()
+        print("Count of Notion collections: " + str(vectordb._collection.count()))
+    else :
+        print("Count of Notion collections: " + str(vectordb._collection.count()))
+def createChromadb(documents, embeddings) :
+    vectordb = Chroma(persist_directory=CHROMA_PERSIST_DIRECTORY, embedding_function=embeddings,
+                      collection_name=CHROMA_COLLECTION_NAME)
+    if (vectordb._collection.count()== 0):
+        print("Creating chromadb")
+        vectordb = Chroma.from_documents(documents=documents, embedding=embeddings, persist_directory=CHROMA_PERSIST_DIRECTORY, collection_name=CHROMA_COLLECTION_NAME)
+        vectordb.persist()
+        print("Count of collections: " + str(vectordb._collection.count()))
+    else :
+        print("Count of collections: " + str(vectordb._collection.count()))
+def createFaissVectorstore(documents, embeddings) :
+    print("Creating vectorstore...")
+    vectorstore = FAISS.from_documents(documents, embeddings)
+    with open("myvectorstore.pkl", "wb") as f:
+        pickle.dump(vectorstore, f)
+def enrichMetada(docs):
+    for doc in docs:
+        for m in custom_meta_data:
+            if (doc.metadata["source"] != ""):
+                if ((m.get("name"))in doc.metadata["source"] ):
+                     doc.metadata["name"] = m.get("name")
+                     doc.metadata["profile"] = m.get("profile")
+                     doc.metadata["creationYear"] = m.get("creationYear")
+                     doc.metadata["topics"] = m.get("topics")
+class MyLoader:
+    def __init__(self, file_path, **kwargs):
+        if file_path.endswith('.docx'):
+          self.loader = UnstructuredWordDocumentLoader(file_path, **kwargs)
+        elif file_path.endswith('.pdf'):
+          self.loader = PyPDFLoader(file_path, **kwargs)
+        elif file_path.endswith('.csv'):
+            self.loader = CSVLoader(file_path, **kwargs)
+        else:
+            self.loader = TextLoader(file_path, **kwargs)
+    def load(self):
+        return self.loader.load()
+custom_meta_data = [
+    {
+        "name":"Tanmay Chopra",
+        "profile":"https://www.linkedin.com/in/tanmayc98/",
+        "creationYear":"2023",
+        "topics":"Pinecone",
+    },
+    {
+        "name":"Neal Patel",
+        "profile":"https://www.linkedin.com/in/nealpatel112/",
+        "creationYear":"2023",
+        "topics" :"Core - Model",
+    },
+    {
+        "name":"Navid",
+        "profile":"https://www.linkedin.com/in/Navid",
+        "creationYear":"2022",
+        "topics":"LLM",
+    },
+   {
+        "name":"Josua Krause",
+        "profile":"https://www.linkedin.com/in/Josua",
+        "creationYear":"2022",
+        "topics":"vector databases",
+    },
+    {
+        "name":"Jay Zhong",
+        "profile":"https://www.linkedin.com/in/Jay",
+        "creationYear":"2021",
+        "topics" : "LLM",
+    },
+    {
+        "name":"Evan",
+        "profile":"https://www.linkedin.com/in/Evan",
+        "creationYear":"2021",
+        "topics":"OpenAI",
+    },
+       {
+        "name":"Siva_values",
+        "profile":"https://www.linkedin.com/Siva",
+        "creationYear":"2023",
+        "topics":"Personal goals"
+    },
+    ]
+custom_meta_data = [
+    {
+        "name":"Tanmay Chopra",
+        "profile":"https://www.linkedin.com/in/tanmayc98/",
+        "creationYear":"2023",
+        "topics":"Pinecone",
+    },
+    {
+        "name":"Neal Patel",
+        "profile":"https://www.linkedin.com/in/nealpatel112/",
+        "creationYear":"2023",
+        "topics" :"Core - Model",
+    },
+    {
+        "name":"Navid",
+        "profile":"https://www.linkedin.com/in/Navid",
+        "creationYear":"2022",
+        "topics":"LLM",
+    },
+   {
+        "name":"Josua Krause",
+        "profile":"https://www.linkedin.com/in/Josua",
+        "creationYear":"2022",
+        "topics":"vector databases",
+    },
+    {
+        "name":"Jay Zhong",
+        "profile":"https://www.linkedin.com/in/Jay",
+        "creationYear":"2021",
+        "topics" : "LLM",
+    },
+    {
+        "name":"Evan",
+        "profile":"https://www.linkedin.com/in/Evan",
+        "creationYear":"2021",
+        "topics":"OpenAI",
+    },
+       {
+        "name":"Siva_values",
+        "profile":"https://www.linkedin.com/Siva",
+        "creationYear":"2023",
+        "topics":"Personal goals"
+    },
+    ]
+custom_meta_data = [
+    {
+        "name":"Tanmay Chopra",
+        "profile":"https://www.linkedin.com/in/tanmayc98/",
+        "creationYear":"2023",
+        "topics":"Pinecone",
+    },
+    {
+        "name":"Neal Patel",
+        "profile":"https://www.linkedin.com/in/nealpatel112/",
+        "creationYear":"2023",
+        "topics" :"Core - Model",
+    },
+    {
+        "name":"Navid",
+        "profile":"https://www.linkedin.com/in/Navid",
+        "creationYear":"2022",
+        "topics":"LLM",
+    },
+   {
+        "name":"Josua Krause",
+        "profile":"https://www.linkedin.com/in/Josua",
+        "creationYear":"2022",
+        "topics":"vector databases",
+    },
+    {
+        "name":"Jay Zhong",
+        "profile":"https://www.linkedin.com/in/Jay",
+        "creationYear":"2021",
+        "topics" : "LLM",
+    },
+    {
+        "name":"Evan",
+        "profile":"https://www.linkedin.com/in/Evan",
+        "creationYear":"2021",
+        "topics":"OpenAI",
+    },
+       {
+        "name":"Siva_values",
+        "profile":"https://www.linkedin.com/Siva",
+        "creationYear":"2023",
+        "topics":"Personal goals"
+    },
+    ]
+def ingestData():
+    os.environ['OPENAI_API_KEY'] =OPENAI_API_KEY
+    print("Loading data...")
+    embeddings = OpenAIEmbeddings()
+    if (DB_TYPE == DBTypes['FAISS'].value or DB_TYPE == DBTypes['CHROMA'].value) :
+        loader = DirectoryLoader(DATA_DIRECTORY, glob="**/*.*", loader_cls=MyLoader)
+        print("Loading directory")
+        docs = loader.load()
+        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
+        enrichMetada(docs)
+        print("splitting documents")
+        documents = (text_splitter.split_documents(docs))
+        if (DB_TYPE == DBTypes['FAISS']):
+            createFaissVectorstore(documents, embeddings)
+        elif (DB_TYPE == DBTypes['CHROMA'].value) :
+            createChromadb(documents, embeddings)
+    elif (DB_TYPE == DBTypes['NOTION'].value):
+        loader = NotionDBLoader(
+            integration_token=NOTION_API_KEY,
+            database_id=NOTION_DB,
+            request_timeout_sec=30,  # optional, defaults to 10
+        )
+        documents = loader.load()
+        createChromaFromNotiondb(documents, embeddings)
+#ingestData()

logs/output.log ADDED Viewed

	@@ -0,0 +1,7 @@

+Running on local URL:  http://127.0.0.1:7860
+Running on public URL: https://5ecacbc10380802821.gradio.live
+This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
+Keyboard interruption in main thread... closing server.
+Killing tunnel 127.0.0.1:7860 <> https://5ecacbc10380802821.gradio.live
+Keyboard interruption in main thread... closing server.

metadatainfo.py ADDED Viewed

	@@ -0,0 +1,27 @@

+from langchain.chains.query_constructor.base import AttributeInfo
+metadata_field_info = [
+    AttributeInfo(
+        name="source",
+        description="Document path",
+        type="str",
+    ),
+    AttributeInfo(
+        name="name",
+        description="Name of the person",
+        type="str",
+    ),
+    AttributeInfo(
+        name="profile",
+        description="Linkedin profile",
+        type="str",
+    ),
+       AttributeInfo(
+        name="creationYear",
+        description="creation Year",
+        type="str",
+    ),AttributeInfo(
+        name="topics",
+        description="The topics the person discussed",
+        type="str",
+    ),
+    ]

myvectorstore.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d94de10425e826e311340ed98b6bc9c176cbbb650c87b14fdf610e697f23b84
+size 122738

notionMetadataInfo.py ADDED Viewed

	@@ -0,0 +1,23 @@

+from langchain.chains.query_constructor.base import AttributeInfo
+notion_metadata_field_info = [
+    AttributeInfo(
+        name="source",
+        description="source",
+        type="str",
+    ),
+    AttributeInfo(
+        name="name",
+        description="Name",
+        type="str",
+    ),
+    AttributeInfo(
+        name="id",
+        description="Id of the person",
+        type="str",
+    ),
+    AttributeInfo(
+        name="tags",
+        description="tags",
+        type="str",
+    ),
+    ]

notiondb/chroma.sqlite3 ADDED Viewed

Binary file (98.3 kB). View file

old/app_copy.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import os
+from typing import Optional, Tuple
+from threading import Lock
+from query_data import chain_options
+import gradio as gr
+from query_data import get_basic_qa_chain
+def set_openai_api_key(api_key: str):
+    """Set the api key and return chain.
+    If no api_key, then None is returned.
+    """
+    os.environ["OPENAI_API_KEY"] = "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"
+    if api_key:
+        #os.environ["OPENAI_API_KEY"] = api_key
+        chain = get_basic_qa_chain
+        #os.environ["OPENAI_API_KEY"] = ""
+        return chain
+def chatFlag(message) :
+        return message
+'''
+def getChainSelectedByUser() :
+    chain = get_basic_qa_chain
+    if (chainType == "get_qa_with_sources_chain" ):
+        chain = get_qa_with_sources_chain
+    elif (chainType == "get_custom_prompt_qa_chain"):
+        chain = get_custom_prompt_qa_chain
+    elif (chainType == "get_condense_prompt_qa_chain"):
+        chain = get_condense_prompt_qa_chain
+    elif (chainType == "get_retrievalqa_with_sources_chain"):
+        chain = get_retrievalqa_with_sources_chain
+    print("landed")
+    print("chainType" + chainType.value)
+    return chain
+'''
+class ChatWrapper:
+    def __init__(self):
+        self.lock = Lock()
+    def __call__(
+        self, api_key: str, inp: str, history: Optional[Tuple[str, str]], chain
+    ):
+        """Execute the chat functionality."""
+        self.lock.acquire()
+        try:
+            history = history or []
+            # If chain is None, that is because no API key was provided.
+            if chain is None:
+                history.append((inp, "Please paste your OpenAI key to use"))
+                return history, history
+            # Set OpenAI key
+            import openai
+            openai.api_key = api_key
+            # Run chain and append input.
+            output = chain({"question": inp})["answer"]
+            history.append((inp, output))
+        except Exception as e:
+            raise e
+        finally:
+            self.lock.release()
+        return history, history
+chat = ChatWrapper()
+block = gr.Blocks(css=".gradio-container {background-color: lightblue}",
+)
+with block:
+    with gr.Row():
+        gr.Markdown(
+            "<h3><center>Chat-Your-Data</center></h3>")
+        openai_api_key_textbox = gr.Textbox(
+            placeholder="",
+            show_label=False,
+            lines=1,
+            type="password",
+        )
+    chatbot = gr.Chatbot()
+    with gr.Row():
+        message = gr.Textbox(
+            label="What's your question?",
+            placeholder="Ask questions about the uploaded documents",
+            lines=1,
+        )
+        submit = gr.Button(value="Send", variant="secondary").style(
+            full_width=False)
+    gr.Examples(
+        examples=[
+            "What did the president say about Ketanji Brown Jackson?",
+            "Did he mention Stephen Breyer?",
+            "What was his stance on Ukraine?",
+        ],
+        inputs=message,
+    )
+    with gr.Row():
+        chainType = gr.Dropdown(list(chain_options.keys()),
+                                label="Chain Type",
+                                value="get_retrievalqa_with_sources_chain",
+                      )
+    gr.HTML("Mime your data by AI.")
+    gr.HTML(
+        "<center>Powered by <a href='https://github.com/hwchase17/langchain'>LangChain 🦜️🔗</a></center>"
+    )
+    state = gr.State()
+    agent_state = gr.State()
+    submit.click(chat, inputs=[openai_api_key_textbox, message,
+                 state, agent_state], outputs=[chatbot, state])
+    message.submit(chat, inputs=[
+                   openai_api_key_textbox, message, state, agent_state], outputs=[chatbot, state])
+    openai_api_key_textbox.change(
+        set_openai_api_key,
+        inputs=[openai_api_key_textbox],
+        outputs=[agent_state],
+    )
+    #chainType.change(getChainSelectedByUser(), inputs=[chainType.value],
+    #                outputs=[agent_state] )
+block.launch(debug=True)

query_data.py ADDED Viewed

	@@ -0,0 +1,212 @@

+import pickle
+from Constants import *
+from langchain.chains import (ConversationalRetrievalChain, RetrievalQA,
+                              RetrievalQAWithSourcesChain)
+from langchain.chains.query_constructor.base import AttributeInfo
+from langchain.chat_models import ChatOpenAI
+from langchain.memory import ConversationBufferMemory
+from langchain.prompts.chat import (ChatPromptTemplate,
+                                    HumanMessagePromptTemplate,
+                                    SystemMessagePromptTemplate)
+from langchain.prompts.prompt import PromptTemplate
+#from langchain.retrievers.self_query import BaseTranslator
+from langchain.retrievers.self_query.base import SelfQueryRetriever
+from langchain.chains.query_constructor.ir  import  Visitor
+from langchain.vectorstores import Chroma
+from langchain.vectorstores.base import VectorStoreRetriever
+from metadatainfo import metadata_field_info
+from notionMetadataInfo import notion_metadata_field_info
+from langchain.embeddings import OpenAIEmbeddings
+from typing import Any, List, Optional, Sequence, Union
+import chromadb
+from db_types import *
+_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
+You can assume the question about persons.
+Chat History:
+{chat_history}
+Follow Up Input: {question}
+Standalone question:"""
+CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
+template = """You are an AI assistant for answering questions about persons.
+You are given the following extracted parts of a long document and a question. Provide a conversational answer.
+If you don't know the answer, do not try to makeup the answer from other sources. If the answer is found quote the source of the answer as SOURCE:
+Also include Topics in the answers as "TOPICS": Also include tags in the answers as "TAGS":
+Question: {question}
+=========
+{context}
+=========
+Answer in Markdown:"""
+QA_PROMPT = PromptTemplate(template=template, input_variables=[
+                           "question", "context"])
+class MyVisitor(Visitor) :
+    def visit_operation(self, op) -> Any:
+        print ("in operation")
+        return op
+    def visit_comparison(self, comparison) -> Any:
+        print("in comparison")
+        return comparison
+    def visit_structured_query(self, arg2) -> Any:
+        print("in structured query "+ arg2.query)
+        return self, arg2
+def load_retriever():
+    retriever = VectorStoreRetriever(vectorstore=get_vectorstore(),dict=metadata_field_info)
+    return retriever
+def get_vectorstore():
+    print("Reading from vectorstore " + DB_TYPE)
+    custom_meta_data_info = metadata_field_info
+    if (DB_TYPE==DBTypes['FAISS'].value) :
+        print("reading faiss vectorstore")
+        vectorstore = PERSIST_DIRECTORY + "myvectorstore.pkl"
+        with open(vectorstore, "rb") as f:
+            vectorstore = pickle.load(f)
+    elif (DB_TYPE == DBTypes['NOTION'].value) :
+        print("reading from Notion...")
+        custom_meta_data_info = notion_metadata_field_info
+        vectorstore = Chroma(persist_directory=NOTION_PERSIST_DIRECTORY,embedding_function=OpenAIEmbeddings(),collection_name=NOTION_COLLECTION_NAME)
+        print("Notion collection count : " + str(vectorstore._collection.count()))
+    else :
+        vectorstore = Chroma(persist_directory=CHROMA_PERSIST_DIRECTORY,collection_name=CHROMA_COLLECTION_NAME,embedding_function=OpenAIEmbeddings())
+        print("Chroma collection count : " + str(vectorstore._collection.count()))
+        #vectorstore = Chroma(persist_directory=PERSIST_DIRECTORY,embedding_function=OpenAIEmbeddings(),collection_name="chatdata")
+    return vectorstore
+def get_basic_qa_chain():
+    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+    retriever = load_retriever()
+    memory = ConversationBufferMemory(
+        memory_key="chat_history", return_messages=True)
+    model = ConversationalRetrievalChain.from_llm(
+        llm=llm,
+        retriever=retriever,
+        memory=memory,
+        verbose=True)
+    return model
+def get_custom_prompt_qa_chain():
+    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+    retriever = load_retriever()
+    memory = ConversationBufferMemory(
+        memory_key="chat_history", return_messages=True)
+    # see: https://github.com/langchain-ai/langchain/issues/6635
+    # see: https://github.com/langchain-ai/langchain/issues/1497
+    model = ConversationalRetrievalChain.from_llm(
+        llm=llm,
+        retriever=retriever,
+        memory=memory,
+        combine_docs_chain_kwargs={"prompt": QA_PROMPT})
+    return model
+def get_condense_prompt_qa_chain():
+    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+    retriever = load_retriever()
+    memory = ConversationBufferMemory(
+        memory_key="chat_history", return_messages=True)
+    # see: https://github.com/langchain-ai/langchain/issues/5890
+    model = ConversationalRetrievalChain.from_llm(
+        llm=llm,
+        retriever=retriever,
+        memory=memory,
+        condense_question_prompt=CONDENSE_QUESTION_PROMPT,
+        combine_docs_chain_kwargs={"prompt": QA_PROMPT})
+    return model
+def get_retrievalqa_with_sources_chain():
+    system_template="""Use the following pieces of context to answer the users question.
+    Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
+    Also include Topics in the answers as "TOPICS". Also include tags in the answer as "TAGS". Include creationYear in the answers as "YEAR". If you don't know the answer, just say that "I donot know", don't try to make up an answer.
+    ----------------
+    {summaries}"""
+    messages = [
+        SystemMessagePromptTemplate.from_template(system_template),
+        HumanMessagePromptTemplate.from_template("{question}")
+    ]
+    prompt = ChatPromptTemplate.from_messages(messages)
+    chain_type_kwargs = {"prompt": prompt}
+    document_content_description = "Personal files"
+    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+    vectorstore = get_vectorstore()
+    history=[]
+    retriever = SelfQueryRetriever.from_llm(
+        llm,
+        vectorstore,
+        document_content_description,
+        metadata_field_info,
+        #structured_query_translator=myVisitor,
+        verbose=True,
+        enable_limit=True,
+        )
+    #myVisitor = MyVisitor()
+    def model_func(question) :
+        #retriever.get_relevant_documents(query)
+        '''
+        def model_func(question):
+            # bug: this doesn't work with the built-in memory
+            # hacking around it for the tutorial
+            # see: https://github.com/langchain-ai/langchain/issues/5630
+            result = retriever.get_relevant_documents(question)
+            history.append((question, result['answer']))
+            return result
+        return model_func
+    '''
+        #print("metadata : " + retriever.metadata)
+        chain = RetrievalQAWithSourcesChain.from_chain_type(llm,
+                                                    chain_type="stuff",
+                                                    retriever=retriever,
+                                                    chain_type_kwargs=chain_type_kwargs
+                                                    )
+        results = chain({"question": question})
+        return results
+    return model_func
+def get_qa_with_sources_chain():
+    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
+    retriever = load_retriever()
+    history = []
+    model = ConversationalRetrievalChain.from_llm(
+        llm=llm,
+        retriever=retriever,
+        return_source_documents=True,
+        verbose=True)
+    def model_func(question):
+        # bug: this doesn't work with the built-in memory
+        # hacking around it for the tutorial
+        # see: https://github.com/langchain-ai/langchain/issues/5630
+        new_input = {"question": question['question'], "chat_history": history}
+        for i in new_input:
+            print("new_input"+ i)
+        result = model(new_input)
+        history.append((question['question'], result['answer']))
+        return result
+    return model_func
+chain_options = {
+    "basic": get_basic_qa_chain,
+    "with_sources": get_qa_with_sources_chain,
+    "custom_prompt": get_custom_prompt_qa_chain,
+    "condense_prompt": get_condense_prompt_qa_chain,
+    "retrieval_sources_chain" : get_retrievalqa_with_sources_chain,
+}

read_notion.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import requests
+from apiKey import *
+from Constants import *
+NOTION_TOKEN = NOTION_API_KEY
+DATABASE_ID = NOTION_DB
+headers = {
+    "Authorization": "Bearer " + NOTION_TOKEN,
+    "Content-Type": "application/json",
+    "Notion-Version": "2022-06-28",
+}
+def get_pages(num_pages=None):
+    """
+    If num_pages is None, get all pages, otherwise just the defined number.
+    """
+    url = f"https://api.notion.com/v1/databases/{DATABASE_ID}/query"
+    get_all = num_pages is None
+    page_size = MAX_PAGES_TO_READ if get_all else num_pages
+    payload = {"page_size": page_size}
+    response = requests.post(url, json=payload, headers=headers)
+    data = response.json()
+    # Comment this out to dump all data to a file
+    # import json
+    # with open('db.json', 'w', encoding='utf8') as f:
+    #    json.dump(data, f, ensure_ascii=False, indent=4)
+    results = data["results"]
+    while data["has_more"] and get_all:
+        payload = {"page_size": page_size, "start_cursor": data["next_cursor"]}
+        url = f"https://api.notion.com/v1/databases/{DATABASE_ID}/query"
+        response = requests.post(url, json=payload, headers=headers)
+        data = response.json()
+        results.extend(data["results"])
+    ''' for r in results :
+        print(r)'''
+    return results
+#get_pages()

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+langchain
+openai
+faiss-cpu
+unstructured
+tiktoken
+rich #for console formatting
+gradio

utilities.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from enum import Enum
+from typing import List, Tuple, Type
+import numpy as np
+from langchain.docstore.document import Document
+def transform_complex_metadata(
+    documents: List[Document],
+    *,
+    allowed_types: Tuple[Type, ...] = (str, bool, int, float)
+) -> List[Document]:
+    """Filter out metadata types that are not supported for a vector store."""
+    updated_documents = []
+    newValue = ""
+    for document in documents:
+        transformed_metadata = {}
+        for key, value in document.metadata.items():
+            if not isinstance(value, allowed_types):
+                if isinstance(value,list):
+                        newValue = ','.join(value)
+                        transformed_metadata[key] = newValue
+                else: continue
+            else :
+                transformed_metadata[key] = value
+        document.metadata = transformed_metadata
+        updated_documents.append(document)
+    return updated_documents

vectorstore.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:00e5530b85a9588de9a81eb6e60c633d21774b26e8a8fcb0b6be85dfa95167f5
+size 523469