Spaces:

Nexialog
/

CSRD_GPT

Runtime error

App Files Files Community

CSRD_GPT / README.md

Afritz

Update README.md

d79eb36 about 2 years ago

preview code

raw

history blame contribute delete

5.38 kB

	---
	title: CSRD GPT
	emoji: 🌿
	colorFrom: blue
	colorTo: green
	sdk: gradio
	python_version: 3.10.0
	sdk_version: 3.22.1
	app_file: app.py
	pinned: true
	---

	---
	title: CSRD GPT
	emoji: 📊
	colorFrom: red
	colorTo: green
	sdk: gradio
	sdk_version: 4.13.0
	app_file: app.py
	pinned: false
	---

	## Introduction

	Python Version used is: 3.10.0

	## Built With

	- [Gradio](https://www.gradio.app/docs/interface) - Main server and interactive components
	- [OpenAI API](https://platform.openai.com/docs/api-reference) - Main LLM engine used in the app
	- [HuggingFace Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) - Used as the default embedding model

	## Requirements

	> _NOTE:_ Before installing the requirements, rename the file `.env.example` to `.env` and put your OpenAI API key there !

	We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:

	```bash
	git clone https://github.com/Nexialog/RegGPT.git
	cd RegGPT/
	python -m venv env
	```

	In UNIX system:

	```bash
	source venv/bin/activate
	```

	In Windows:

	```bash
	venv\Scripts\activate
	```

	To install all of the required packages to this environment, simply run:

	```bash
	pip install -r requirements.txt
	```

	and all of the required `pip` packages will be installed, and the app will be able to run.

	## Usage of run_script.py

	This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.

	### Process Documents
	To process PDF documents and extract paragraphs and metadata, use the following command:

	```bash
	python run_script.py --type process_documents
	```

	You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.

	### Generate Embeddings
	To generate text embeddings from the processed paragraphs, use the following command:

	```bash
	python run_script.py --type generate_embeddings
	```

	This command will use the default embedding model, but you can specify another model using the `--embedding_model` argument.

	### Process Documents and Generate Embeddings
	To perform both document processing and embedding generation, use:

	```bash
	python run_script.py --type all
	```

	### Command Line Arguments

	- `--type`: Specifies the operation type. Choices are `all`, `process_documents`, or `generate_embeddings`. (required)
	- `--pdf_folder`: Path to the folder containing PDF documents. Default is `pdf_data/`. (optional)
	- `--data_folder`: Path to the folder where processed data and embeddings will be saved. Default is `data/`. (optional)
	- `--embedding_model`: Specifies the model to be used for generating embeddings. Default is `sentence-transformers/multi-qa-mpnet-base-dot-v1`. (optional)
	- `--device`: Specifies the device to be used (CPU or GPU). Choices are `cpu` or `cuda`. Default is `cpu`. (optional)
	- `--min_length`: Specifies the minimum paragraph length for inclusion. Default is `300`. (optional)
	- `--merge_length`: Specifies the merge length for paragraphs. Default is `700`. (optional)

	### Examples

	```bash
	python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800
	```

	```bash
	python run_script.py --type generate_embeddings --device cuda
	```

	### How to use Colab's GPU

	1. Create your own [deploying key from github](https://github.com/Nexialog/RegGPT/settings/keys)
	2. Upload the key to Google Drive on the path : `drive/MyDrive/ssh_key_github/`
	3. Upload the notebook `notebooks/generate_embeddings.ipynb` into a colab session (or use this [link](https://colab.research.google.com/drive/1E7uHJF7gH_36O9ylIgWhiAjHpRJRyvnv?usp=sharing))
	4. Upload the pdf files on the same colab session on the path : `pdf_data/`
	5. Run the notebook on GPU mode and download the folder `data/` containing embeddings and chnukns

	## How to Configure a New BOT

	1. Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data')
	2. Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:

	In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:

	### Basic Settings

	- `DEBUG`: Debugging mode
	- `K_TOTAL`: The total number of retrieved docs
	- `THRESHOLD`: Threshold of retrieval by embeddings
	- `DEVICE`: Device for computation
	- `BOT_NAME`: The name of the bot
	- `MODEL_NAME`: The name of the model

	### Language and Data

	- `DEFAULT_LANGUAGE`: Default language
	- `DATA_FOLDER`: Path to the data folder
	- `EMBEDDING_MODEL`: Embedding model

	### Tokens and Prompts

	- `MAX_TOKENS_REF_QUESTION`: Maximum tokens in the reformulated question
	- `MAX_TOKENS_ANSWER`: Maximum tokens in answers
	- `INIT_PROMPT`: Initial prompt
	- `SOURCES_PROMPT`: Sources prompt for responses

	### Default Questions

	- `DEFAULT_QUESTIONS`: Tuple of default questions

	### Reformulation Prompt

	- `REFORMULATION_PROMPT`: Prompt for reformulating questions

	### Metadata Path

	- `DOC_METADATA_PATH`: Path to document metadata

	## How to Use This BOT

	Run this app locally by:

	```bash
	python app.py
	```

	Open [http://127.0.0.1:7860](http://127.0.0.1:7860) in your browser, and you will see the bot.