| | --- |
| | title: CSRD GPT |
| | emoji: 🌿 |
| | colorFrom: blue |
| | colorTo: green |
| | sdk: gradio |
| | python_version: 3.10.0 |
| | sdk_version: 3.22.1 |
| | app_file: app.py |
| | pinned: true |
| | --- |
| | |
| | --- |
| | title: CSRD GPT |
| | emoji: 📊 |
| | colorFrom: red |
| | colorTo: green |
| | sdk: gradio |
| | sdk_version: 4.13.0 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| |
|
| | ## Introduction |
| |
|
| | Python Version used is: 3.10.0 |
| |
|
| | ## Built With |
| |
|
| | - [Gradio](https://www.gradio.app/docs/interface) - Main server and interactive components |
| | - [OpenAI API](https://platform.openai.com/docs/api-reference) - Main LLM engine used in the app |
| | - [HuggingFace Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) - Used as the default embedding model |
| |
|
| | ## Requirements |
| |
|
| | > **_NOTE:_** Before installing the requirements, rename the file `.env.example` to `.env` and put your OpenAI API key there ! |
| |
|
| | We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt: |
| |
|
| | ```bash |
| | git clone https://github.com/Nexialog/RegGPT.git |
| | cd RegGPT/ |
| | python -m venv env |
| | ``` |
| |
|
| | In UNIX system: |
| |
|
| | ```bash |
| | source venv/bin/activate |
| | ``` |
| |
|
| | In Windows: |
| |
|
| | ```bash |
| | venv\Scripts\activate |
| | ``` |
| |
|
| | To install all of the required packages to this environment, simply run: |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | and all of the required `pip` packages will be installed, and the app will be able to run. |
| |
|
| | ## Usage of run_script.py |
| | |
| | This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments. |
| | |
| | ### Process Documents |
| | To process PDF documents and extract paragraphs and metadata, use the following command: |
| | |
| | ```bash |
| | python run_script.py --type process_documents |
| | ``` |
| | |
| | You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length. |
| | |
| | ### Generate Embeddings |
| | To generate text embeddings from the processed paragraphs, use the following command: |
| | |
| | ```bash |
| | python run_script.py --type generate_embeddings |
| | ``` |
| | |
| | This command will use the default embedding model, but you can specify another model using the `--embedding_model` argument. |
| |
|
| | ### Process Documents and Generate Embeddings |
| | To perform both document processing and embedding generation, use: |
| |
|
| | ```bash |
| | python run_script.py --type all |
| | ``` |
| |
|
| | ### Command Line Arguments |
| |
|
| | - `--type`: Specifies the operation type. Choices are `all`, `process_documents`, or `generate_embeddings`. (required) |
| | - `--pdf_folder`: Path to the folder containing PDF documents. Default is `pdf_data/`. (optional) |
| | - `--data_folder`: Path to the folder where processed data and embeddings will be saved. Default is `data/`. (optional) |
| | - `--embedding_model`: Specifies the model to be used for generating embeddings. Default is `sentence-transformers/multi-qa-mpnet-base-dot-v1`. (optional) |
| | - `--device`: Specifies the device to be used (CPU or GPU). Choices are `cpu` or `cuda`. Default is `cpu`. (optional) |
| | - `--min_length`: Specifies the minimum paragraph length for inclusion. Default is `300`. (optional) |
| | - `--merge_length`: Specifies the merge length for paragraphs. Default is `700`. (optional) |
| |
|
| | ### Examples |
| |
|
| | ```bash |
| | python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800 |
| | ``` |
| |
|
| | ```bash |
| | python run_script.py --type generate_embeddings --device cuda |
| | ``` |
| |
|
| | ### How to use Colab's GPU |
| |
|
| | 1. Create your own [deploying key from github](https://github.com/Nexialog/RegGPT/settings/keys) |
| | 2. Upload the key to Google Drive on the path : `drive/MyDrive/ssh_key_github/` |
| | 3. Upload the notebook `notebooks/generate_embeddings.ipynb` into a colab session (or use this [link](https://colab.research.google.com/drive/1E7uHJF7gH_36O9ylIgWhiAjHpRJRyvnv?usp=sharing)) |
| | 4. Upload the pdf files on the same colab session on the path : `pdf_data/` |
| | 5. Run the notebook on GPU mode and download the folder `data/` containing embeddings and chnukns |
| |
|
| | ## How to Configure a New BOT |
| |
|
| | 1. Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data') |
| | 2. Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow: |
| |
|
| | In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means: |
| | |
| | ### Basic Settings |
| | |
| | - `DEBUG`: Debugging mode |
| | - `K_TOTAL`: The total number of retrieved docs |
| | - `THRESHOLD`: Threshold of retrieval by embeddings |
| | - `DEVICE`: Device for computation |
| | - `BOT_NAME`: The name of the bot |
| | - `MODEL_NAME`: The name of the model |
| |
|
| | ### Language and Data |
| |
|
| | - `DEFAULT_LANGUAGE`: Default language |
| | - `DATA_FOLDER`: Path to the data folder |
| | - `EMBEDDING_MODEL`: Embedding model |
| |
|
| | ### Tokens and Prompts |
| |
|
| | - `MAX_TOKENS_REF_QUESTION`: Maximum tokens in the reformulated question |
| | - `MAX_TOKENS_ANSWER`: Maximum tokens in answers |
| | - `INIT_PROMPT`: Initial prompt |
| | - `SOURCES_PROMPT`: Sources prompt for responses |
| |
|
| | ### Default Questions |
| |
|
| | - `DEFAULT_QUESTIONS`: Tuple of default questions |
| |
|
| | ### Reformulation Prompt |
| |
|
| | - `REFORMULATION_PROMPT`: Prompt for reformulating questions |
| |
|
| | ### Metadata Path |
| |
|
| | - `DOC_METADATA_PATH`: Path to document metadata |
| |
|
| | ## How to Use This BOT |
| |
|
| | Run this app locally by: |
| |
|
| | ```bash |
| | python app.py |
| | ``` |
| |
|
| | Open [http://127.0.0.1:7860](http://127.0.0.1:7860) in your browser, and you will see the bot. |