Spaces:
Sleeping
Sleeping
| title: "Paper-based RAG" | |
| emoji: "๐" | |
| colorFrom: "blue" | |
| colorTo: "indigo" | |
| sdk: gradio | |
| sdk_version: 5.8.0 | |
| app_file: app.py | |
| python_version: 3.11.0 | |
| models: | |
| - sentence-transformers/all-mpnet-base-v2 | |
| tags: | |
| - question-answering | |
| - gradio | |
| - LLM | |
| - document-processing | |
| # Document QA System | |
| Document Question-Answering system that utilizes LlamaIndex for document indexing, generation, and retrieval and Gradio for the user interface. | |
| ## Technologies | |
| - Data source | |
| - [Paper about BERT](https://arxiv.org/pdf/1810.04805) located in the data directory are used as the data source for indexing. | |
| - Chunking | |
| - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | |
| - LLM | |
| - The system utilizes the `gpt-4o-mini` for generating responses | |
| - Retriever, Reranker | |
| - gpt-4o-mini is used | |
| - UI | |
| - The user interface is built with Gradio | |
| ## Installation | |
| ### Prerequisites | |
| 1. **Docker**: | |
| - [Install Docker](https://docs.docker.com/get-docker/) | |
| 2. **API keys** | |
| - [OpenAI](https://platform.openai.com/api-keys) | |
| - [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key): | |
| ### Using HuggingFace Spaces | |
| 1. Follow the link to the [paper-based-rag](https://huggingface.co/spaces/Gepe55o/paper_based_rag) on Spaces. | |
| 2. Upload your paper for indexing or use the default [paper](https://arxiv.org/pdf/1810.04805) about BERT. | |
| ### Using Docker | |
| 1. **Build the Docker Image**: | |
| ```bash | |
| docker build -t doc-qa-system . | |
| ``` | |
| 2. **Run the Docker Container**: | |
| ```bash | |
| docker run -p 7860:7860 doc-qa-system | |
| ``` | |
| 4. **Access the Interface**: | |
| - Open your browser and go to `http://localhost:7860`. | |
| ### Using Python | |
| 1. **Install Dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Add paper to the data directory**: | |
| - Add the paper you want to index to the `data` directory or use default [paper](https://arxiv.org/pdf/1810.04805) about BERT. | |
| 2. **Run indexing data**: | |
| ```bash | |
| python index.py | |
| ``` | |
| 3. **Run the Application**: | |
| ```bash | |
| python app.py | |
| ``` | |
| ## Project structure | |
| ```bash | |
| โโโ app.py # Gradio application | |
| โโโ main.py # Main script for answering queries | |
| โโโ utils/ # Utility functions and helpers | |
| โ โโโ constant.py # Constant values used in the project | |
| โ โโโ index.py # Handles document indexing | |
| โ โโโ retriever.py # Retrieves and ranks documents | |
| โ โโโ settings.py # Configuration settings | |
| โโโ data/ # Directory containing documents to be indexed | |
| โโโ index/ # Stores the generated index files | |
| โ โโโ default__vector_store.json | |
| โ โโโ docstore.json | |
| โ โโโ graph_store.json | |
| โ โโโ image__vector_store.json | |
| โ โโโ index_store.json | |
| โโโ requirements.txt # Python dependencies | |
| โโโ Dockerfile # Docker configuration | |
| โโโ README.md # Project documentation | |
| ``` | |
| ## Example questions | |
| - What is the pre-training procedure for BERT, and how does it differ from traditional supervised learning? | |
| - Can you describe how BERT can be fine-tuned for tasks like question answering or sentiment analysis? |