xke commited on
Commit
1aa8590
Β·
1 Parent(s): 2a5ba6e

init without binary files

Browse files
Files changed (6) hide show
  1. Dockerfile +11 -0
  2. README.md +60 -7
  3. app.py +124 -0
  4. chainlit.md +14 -0
  5. requirements.txt +8 -0
  6. screenshot.png +0 -0
Dockerfile ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11
2
+ RUN useradd -m -u 1000 user
3
+ USER user
4
+ ENV HOME=/home/user \
5
+ PATH=/home/user/.local/bin:$PATH
6
+ WORKDIR $HOME/app
7
+ COPY --chown=user . $HOME/app
8
+ COPY ./requirements.txt ~/app/requirements.txt
9
+ RUN pip install -r requirements.txt
10
+ COPY . .
11
+ CMD ["chainlit", "run", "app.py", "--port", "7860"]
README.md CHANGED
@@ -1,10 +1,63 @@
1
  ---
2
- title: Chroma Qa Chat
3
- emoji: 🏒
4
- colorFrom: green
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: 'Chroma Q&A with Sources Element'
3
+ tags: ['chroma', 'chainlit', 'qa']
 
 
 
 
4
  ---
5
 
6
+ # Chroma Q&A with Sources Element
7
+
8
+ This repository contains a Chainlit application that provides a question-answering service using documents stored in a Chroma vector store. It allows users to upload PDF documents, which are then chunked, embedded, and indexed for efficient retrieval. When a user asks a question, the application retrieves relevant document chunks and uses OpenAI's language model to generate an answer, citing the sources it used.
9
+
10
+ ## High-Level Description
11
+
12
+ The `app.py` script performs the following functions:
13
+
14
+ 1. **PDF Processing (`process_pdfs`)**: Chunks PDF files into smaller text segments, creates embeddings for each chunk, and stores them in Chroma.
15
+ 2. **Document Indexing (`index`)**: Uses `SQLRecordManager` to track document writes into the vector store.
16
+ 3. **Question Answering (`on_message`)**: When a user asks a question, the application retrieves relevant document chunks and generates an answer using OpenAI's language model, providing the sources for transparency.
17
+
18
+ ## Quickstart
19
+
20
+ ### Prerequisites
21
+
22
+ - Python 3.11 or higher
23
+ - Chainlit installed
24
+ - PDF documents to be indexed
25
+
26
+ ### Setup and Run
27
+
28
+ 1. **Install Dependencies:**
29
+
30
+ Install the required Python packages specified in `requirements.txt`.
31
+
32
+ ```shell
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ 2. **Process PDFs:**
37
+
38
+ Place your PDF documents in the `./pdfs` directory.
39
+
40
+ 3. **Run the Application:**
41
+
42
+ Use the provided `Dockerfile` to build and run the application.
43
+
44
+ ```shell
45
+ docker build -t chroma-qa-chat .
46
+ docker run -p 7860:7860 chroma-qa-chat
47
+ ```
48
+
49
+ Access the application at `http://localhost:7860`.
50
+
51
+ ## Code Definitions
52
+
53
+ - `process_pdfs`: Function that processes PDF files and indexes them into Chroma.
54
+ - `on_chat_start`: Event handler that sets up the Chainlit session with the necessary components for question answering.
55
+ - `on_message`: Event handler that processes user messages, retrieves relevant information, and sends back an answer.
56
+ - `PostMessageHandler`: Callback handler that posts the sources of the retrieved documents as a Chainlit element.
57
+
58
+ ![Screenshot](./screenshot.png)
59
+
60
+ ## See Also
61
+
62
+ For a visual guide on how to use this application, watch the video by [Chris Alexiuk](https://www.youtube.com/watch?v=9SBUStfCtmk&ab_channel=ChrisAlexiuk).
63
+
app.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+ from pathlib import Path
3
+ from langchain_openai import ChatOpenAI, OpenAIEmbeddings
4
+ from langchain.prompts import ChatPromptTemplate
5
+ from langchain.schema import StrOutputParser
6
+ from langchain_community.document_loaders import (
7
+ PyMuPDFLoader,
8
+ )
9
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
10
+ from langchain.vectorstores.chroma import Chroma
11
+ from langchain.indexes import SQLRecordManager, index
12
+ from langchain.schema import Document
13
+ from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig
14
+ from langchain.callbacks.base import BaseCallbackHandler
15
+
16
+ import chainlit as cl
17
+
18
+ chunk_size = 1024
19
+ chunk_overlap = 50
20
+
21
+ embeddings_model = OpenAIEmbeddings()
22
+
23
+ PDF_STORAGE_PATH = "./pdfs"
24
+
25
+
26
+ def process_pdfs(pdf_storage_path: str):
27
+ pdf_directory = Path(pdf_storage_path)
28
+ docs = [] # type: List[Document]
29
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
30
+
31
+ for pdf_path in pdf_directory.glob("*.pdf"):
32
+ loader = PyMuPDFLoader(str(pdf_path))
33
+ documents = loader.load()
34
+ docs += text_splitter.split_documents(documents)
35
+
36
+ doc_search = Chroma.from_documents(docs, embeddings_model)
37
+
38
+ namespace = "chromadb/my_documents"
39
+ record_manager = SQLRecordManager(
40
+ namespace, db_url="sqlite:///record_manager_cache.sql"
41
+ )
42
+ record_manager.create_schema()
43
+
44
+ index_result = index(
45
+ docs,
46
+ record_manager,
47
+ doc_search,
48
+ cleanup="incremental",
49
+ source_id_key="source",
50
+ )
51
+
52
+ print(f"Indexing stats: {index_result}")
53
+
54
+ return doc_search
55
+
56
+
57
+ doc_search = process_pdfs(PDF_STORAGE_PATH)
58
+ model = ChatOpenAI(model_name="gpt-4", streaming=True)
59
+
60
+
61
+ @cl.on_chat_start
62
+ async def on_chat_start():
63
+ template = """Answer the question based only on the following context:
64
+
65
+ {context}
66
+
67
+ Question: {question}
68
+ """
69
+ prompt = ChatPromptTemplate.from_template(template)
70
+
71
+ def format_docs(docs):
72
+ return "\n\n".join([d.page_content for d in docs])
73
+
74
+ retriever = doc_search.as_retriever()
75
+
76
+ runnable = (
77
+ {"context": retriever | format_docs, "question": RunnablePassthrough()}
78
+ | prompt
79
+ | model
80
+ | StrOutputParser()
81
+ )
82
+
83
+ cl.user_session.set("runnable", runnable)
84
+
85
+
86
+ @cl.on_message
87
+ async def on_message(message: cl.Message):
88
+ runnable = cl.user_session.get("runnable") # type: Runnable
89
+ msg = cl.Message(content="")
90
+
91
+ class PostMessageHandler(BaseCallbackHandler):
92
+ """
93
+ Callback handler for handling the retriever and LLM processes.
94
+ Used to post the sources of the retrieved documents as a Chainlit element.
95
+ """
96
+
97
+ def __init__(self, msg: cl.Message):
98
+ BaseCallbackHandler.__init__(self)
99
+ self.msg = msg
100
+ self.sources = set() # To store unique pairs
101
+
102
+ def on_retriever_end(self, documents, *, run_id, parent_run_id, **kwargs):
103
+ for d in documents:
104
+ source_page_pair = (d.metadata['source'], d.metadata['page'])
105
+ self.sources.add(source_page_pair) # Add unique pairs to the set
106
+
107
+ def on_llm_end(self, response, *, run_id, parent_run_id, **kwargs):
108
+ if len(self.sources):
109
+ sources_text = "\n".join([f"{source}#page={page}" for source, page in self.sources])
110
+ self.msg.elements.append(
111
+ cl.Text(name="Sources", content=sources_text, display="inline")
112
+ )
113
+
114
+ async with cl.Step(type="run", name="QA Assistant"):
115
+ async for chunk in runnable.astream(
116
+ message.content,
117
+ config=RunnableConfig(callbacks=[
118
+ cl.LangchainCallbackHandler(),
119
+ PostMessageHandler(msg)
120
+ ]),
121
+ ):
122
+ await msg.stream_token(chunk)
123
+
124
+ await msg.send()
chainlit.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Welcome to Chainlit! πŸš€πŸ€–
2
+
3
+ Hi there, Developer! πŸ‘‹ We're excited to have you on board. Chainlit is a powerful tool designed to help you prototype, debug and share applications built on top of LLMs.
4
+
5
+ ## Useful Links πŸ”—
6
+
7
+ - **Documentation:** Get started with our comprehensive [Chainlit Documentation](https://docs.chainlit.io) πŸ“š
8
+ - **Discord Community:** Join our friendly [Chainlit Discord](https://discord.gg/k73SQ3FyUh) to ask questions, share your projects, and connect with other developers! πŸ’¬
9
+
10
+ We can't wait to see what you create with Chainlit! Happy coding! πŸ’»πŸ˜Š
11
+
12
+ ## Welcome screen
13
+
14
+ To modify the welcome screen, edit the `chainlit.md` file at the root of your project. If you do not want a welcome screen, just leave this file empty.
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ langchain
2
+ langchain-community
3
+ chainlit
4
+ langchain_openai
5
+ openai
6
+ chromadb
7
+ tiktoken
8
+ pymupdf
screenshot.png ADDED