File size: 2,313 Bytes
27a8994
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Backend

This folder contains the backend services for the Document Chat App.

### `app.py`
This file is the main entry point for the Streamlit web application. It handles the user interface, chat history management, and interacts with the `agent` to process user queries and generate responses.

### `main.py`
This script is responsible for processing documents. It loads and extracts data (tables, texts, images) from PDF files in the `data` directory, summarizes them using the `summerizer` module, and then chunks and adds the processed documents to the vector store. It keeps track of processed files in `processed_files.txt` to avoid reprocessing.

### `data/`
This directory is intended to store the raw PDF documents that need to be processed by the system.

### `vectorStore/`
This directory stores the generated vector embeddings of the processed documents. These embeddings are used by the `agent` for retrieving relevant information during the chat.

### `agent/`
This module contains the logic for the conversational agent, which uses the vector store to answer questions based on the processed documents.

### `summerizer/`
This module provides functionalities for summarizing different types of content (text, images) extracted from the documents.

### `utils/`
This module contains utility functions, such as `helper.py` for loading and extracting data from documents.

### `tool/`
This module likely contains tools or functions used by the agent to perform specific tasks.

### `generator.py`
This file likely contains code related to generating responses or content within the application.

### How to Run
To run the backend application, you will typically run `app.py` using Streamlit after ensuring all dependencies are installed and documents are processed by `main.py`.

```bash

streamlit run app.py

```

### Running `main.py` for Document Embedding
To process and embed documents, run the `main.py` script. This script will load PDF files from the `data` directory, extract and summarize their contents, and then add them to the vector store.

```bash

python main.py

```

Make sure that the `data` directory contains the PDF files you want to process. The script will log processed files in `processed_files.txt` to avoid reprocessing them.