Update README.md
Browse files
README.md
CHANGED
|
@@ -9,50 +9,30 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# BigQA
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
| 18 |
-
|
| 19 |
-
- **Streamlit Interface**: User-friendly web interface.
|
| 20 |
-
- **Vector Search**: Uses HuggingFace embeddings (`all-MiniLM-L6-v2`) and an in-memory vector store.
|
| 21 |
-
- **LLM Integration**: Connects to OpenRouter to use the `openai/gpt-oss-20b` model.
|
| 22 |
-
- **Dataset**: Automatically loads and indexes the `Ono-Enzo/Dataset_test` dataset from Hugging Face.
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- An API Key from OpenRouter
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
```bash
|
| 33 |
-
git clone <repository-url>
|
| 34 |
-
cd RAG_BigQA
|
| 35 |
-
```
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
## Configuration
|
| 43 |
-
|
| 44 |
-
1. Create a `.env` file in the root directory of the project.
|
| 45 |
-
2. Add your OpenRouter API key to the file:
|
| 46 |
-
```env
|
| 47 |
-
OPENROUTER_API_KEY=your_api_key_here
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
## Usage
|
| 51 |
-
|
| 52 |
-
Run the Streamlit application:
|
| 53 |
-
|
| 54 |
-
```bash
|
| 55 |
-
streamlit run app.py
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
The application will open in your default web browser. It will automatically download the dataset, generate embeddings, and prepare the LLM. Once ready, you can type your questions in the input field.
|
|
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π BigQA β Retrieval-Augmented Generation
|
| 13 |
|
| 14 |
+
BigQA is a software architecture designed for querying large volumes of textual data. This application implements a **Retrieval-Augmented Generation (RAG)** pipeline, combining semantic document retrieval with Large Language Models (LLMs) to provide precise, context-aware answers.
|
| 15 |
|
| 16 |
+
## π Scientific Foundation
|
| 17 |
|
| 18 |
+
This implementation is based on the reference architecture proposed in the following research papers:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
* **Design Principles and a Software Reference Architecture for Big Data Question Answering Systems (2023)**.
|
| 21 |
+
[Access Paper](https://www.scitepress.org/Link.aspx?doi=10.5220/0011842700003467)
|
| 22 |
+
* **BigQA: A Software Reference Architecture for Big Data Question Answering Systems (2024)**.
|
| 23 |
+
[Access Paper](https://link.springer.com/chapter/10.1007/978-3-031-64748-2_3)
|
| 24 |
|
| 25 |
+
## π Features
|
|
|
|
| 26 |
|
| 27 |
+
- **RAG Architecture**: Full integration between document retrieval and generative AI.
|
| 28 |
+
- **Vector Search**: Uses HuggingFace embeddings (`all-MiniLM-L6-v2`) for semantic similarity search.
|
| 29 |
+
- **Streamlit Interface**: An intuitive and responsive web interface for real-time querying.
|
| 30 |
+
- **LLM Integration**: Connected via OpenRouter to access state-of-the-art models (e.g., Qwen, Gemini, GPT).
|
| 31 |
+
- **Automated Indexing**: Automatic loading and processing of the `Ono-Enzo/Dataset_test` dataset.
|
| 32 |
|
| 33 |
+
## π οΈ Tech Stack
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
- **LangChain**: Framework for orchestrating the AI logic.
|
| 36 |
+
- **Streamlit**: For the user interface.
|
| 37 |
+
- **OpenRouter**: Gateway for LLM access.
|
| 38 |
+
- **Hugging Face Datasets & Embeddings**: For data management and vectorization.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|