|
|
--- |
|
|
title: beta-NORM |
|
|
emoji: 🧪 |
|
|
colorFrom: blue |
|
|
colorTo: green |
|
|
sdk: docker |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# beta-NORM |
|
|
|
|
|
# NORM Chatbot – Intelligent Query System |
|
|
|
|
|
- [About](#about) |
|
|
- [Overall architecture](#overall-architecture) |
|
|
- [Getting started](#getting-started) |
|
|
- [Environment setup and installation](#environment-setup-and-installation) |
|
|
- [Basic usage pipeline](#basic-usage-pipeline) |
|
|
- [Management and development](#management-and-development) |
|
|
- [Add or update documents](#add-or-update-documents) |
|
|
- [Configuration](#configuration) |
|
|
- [Details about the files](#details-about-the-files) |
|
|
|
|
|
# About |
|
|
|
|
|
The **NORM Chatbot** is a question–answering system based on |
|
|
**Retrieval-Augmented Generation (RAG)** designed for technical documents |
|
|
(for example, standards and reports related to NORM). |
|
|
|
|
|
The goal is to allow the user to ask questions in natural language and |
|
|
receive answers in Portuguese, always explicitly citing the passages and |
|
|
documents that were used as the basis, with numbered references in the |
|
|
format [n]. |
|
|
|
|
|
## Overall architecture |
|
|
|
|
|
The system is composed of four main blocks: |
|
|
|
|
|
1. **Data preparation** |
|
|
`.md` documents are stored in a folder (by default `Docs/`). A |
|
|
normalization step ensures that each file has a clean title in the |
|
|
form of a Markdown heading (`# Article title`), which is later used |
|
|
in the interface and in the references. |
|
|
|
|
|
2. **Embeddings and indexing** |
|
|
- The texts are split into chunks with overlap, according to the |
|
|
configured splitter. |
|
|
- Each chunk is converted into an embedding vector using the model |
|
|
configured in `configs/config.json` (by default |
|
|
`intfloat/multilingual-e5-large`). |
|
|
- All vectors are consolidated in `embeddings.npy` and the |
|
|
corresponding metadata in `metadata.jsonl`, including |
|
|
`document_id`, `document_title`, `fragment_id` and the content of |
|
|
the chunk. |
|
|
- A vector index is created with **FAISS** and saved in `data/index`. |
|
|
|
|
|
3. **RAG API (backend)** |
|
|
A **FastAPI** service exposes: |
|
|
- `POST /query` – receives a question, retrieves the most relevant |
|
|
chunks via FAISS and calls the LLM to generate the answer. |
|
|
- `GET /list_documents` – lists the documents present in the corpus. |
|
|
|
|
|
4. **Web interface (frontend)** |
|
|
A **Streamlit** application consumes the API and offers two tabs: |
|
|
- **Document summaries** – generates summaries from the corpus. |
|
|
- **Interactive chatbot** – free Q&A over the documents. |
|
|
|
|
|
--- |
|
|
|
|
|
# Getting started |
|
|
|
|
|
## Environment setup and installation |
|
|
|
|
|
1. Create and activate a Python virtual environment (3.10 or higher): |
|
|
|
|
|
```bash |
|
|
python3 -m venv venv |
|
|
source venv/bin/activate # Linux / macOS |
|
|
# .\venv\Scripts\activate # Windows PowerShell |
|
|
``` |
|
|
|
|
|
2. Install the dependencies from the `requirements.txt` at the root of |
|
|
the project: |
|
|
|
|
|
```bash |
|
|
pip install --upgrade pip |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
3. Make sure the `.md` documents are in the folder configured in |
|
|
`configs/config.json` (by default, `Docs/*.md`). |
|
|
If the files come from heterogeneous sources (titles on the first |
|
|
line, in the middle, in UPPERCASE, etc.), it is recommended to first |
|
|
run the title normalization script (see next section). |
|
|
|
|
|
--- |
|
|
|
|
|
## Basic usage pipeline |
|
|
|
|
|
Always run commands from the **root of the repository**. |
|
|
|
|
|
### 1. Normalize `.md` titles (optional but recommended) |
|
|
|
|
|
For large and heterogeneous corpora, it is important to ensure that |
|
|
each `.md` file has a clean title in a `# ...` heading. The repository |
|
|
includes a script that attempts to infer and standardize these titles |
|
|
based on the file name and content: |
|
|
|
|
|
```bash |
|
|
python -m scripts.normalize_md_titles |
|
|
``` |
|
|
|
|
|
This step adjusts the `.md` files in `Docs/`, inserting or fixing the |
|
|
first `#` heading of each document, which will later be displayed in the |
|
|
document list in the frontend and used in the references. |
|
|
|
|
|
### 2. Generate embeddings |
|
|
|
|
|
```bash |
|
|
python -m scripts.generate_embeddings |
|
|
``` |
|
|
|
|
|
This script will: |
|
|
|
|
|
- Read the `.md` files defined in `paths.input_path`. |
|
|
- Split the text into chunks, according to the `splitter` defined in |
|
|
the config. |
|
|
- Generate embeddings with the configured model. |
|
|
- Save the consolidated artifacts in `data/embeddings/`: |
|
|
- `embeddings.npy` |
|
|
- `metadata.jsonl` |
|
|
|
|
|
### 2. Build the FAISS index |
|
|
|
|
|
```bash |
|
|
python -m scripts.build_index |
|
|
``` |
|
|
|
|
|
This script will: |
|
|
|
|
|
- Read `data/embeddings/embeddings.npy`. |
|
|
- Normalize the vectors. |
|
|
- Build a `faiss.IndexFlatIP` index. |
|
|
- Save the index to `data/index/faiss.index`. |
|
|
- Copy `metadata.jsonl` to `data/index/metadata.jsonl`. |
|
|
|
|
|
### 3. Run the API (backend) |
|
|
|
|
|
Before starting the API, set the OpenAI key (if you are going to use the |
|
|
real LLM): |
|
|
|
|
|
```bash |
|
|
export OPENAI_API_KEY="YOUR_KEY_HERE" # Linux / macOS |
|
|
# setx OPENAI_API_KEY "YOUR_KEY_HERE" # Windows (PowerShell) |
|
|
``` |
|
|
|
|
|
Then run: |
|
|
|
|
|
```bash |
|
|
uvicorn app.api_server:app --reload --port 8000 |
|
|
``` |
|
|
|
|
|
Main endpoints: |
|
|
|
|
|
- `GET /list_documents` – returns the list of indexed documents with |
|
|
`id` and `title` (derived from the `#` heading of each `.md`). |
|
|
- `POST /query` – receives a JSON with the question (`question`), the |
|
|
number of chunks to retrieve (`top_k`) and, optionally, the model |
|
|
temperature (`temperature`), and returns: |
|
|
- `answer` – the answer text, already citing the sources in the [n] |
|
|
format. |
|
|
- `retrieved` – list of chunks used in the answer, containing |
|
|
`document_id`, `document_title`, `fragment_id`, `content`, and a |
|
|
numeric `citation_id` that corresponds to the [n] used in the |
|
|
text. |
|
|
|
|
|
The backend builds a context with retrieved chunks, assigns a citation |
|
|
number per document (`citation_id`) and passes a reference table to the |
|
|
LLM. The model is instructed, via the `system_prompt`, to cite only |
|
|
these documents using the [n] format and **not to fabricate data**, |
|
|
especially for years after 2025. |
|
|
|
|
|
### 4. Run the Streamlit frontend |
|
|
|
|
|
In another terminal (also at the project root and with the venv |
|
|
activated): |
|
|
|
|
|
```bash |
|
|
streamlit run app/app_front.py |
|
|
``` |
|
|
|
|
|
The interface will open in the default browser, usually at |
|
|
`http://localhost:8501`. |
|
|
|
|
|
In the **Document Summaries** tab, you can list the indexed documents, |
|
|
click on a title (the text comes from the `#` heading of the `.md`) and |
|
|
generate a short summary with coherent [n] references. |
|
|
|
|
|
In the **Interactive Chatbot** tab, you can ask free-form questions |
|
|
about the corpus; the answers also contain [n] citations, and the |
|
|
“Source documents” and “Cited references” sections show only the |
|
|
documents actually cited in the text. |
|
|
|
|
|
--- |
|
|
|
|
|
# Management and development |
|
|
|
|
|
## Add or update documents |
|
|
|
|
|
1. Add or update the `.md` files in the folder configured in |
|
|
`paths.input_path` (by default, `Docs/`). |
|
|
2. Run the embedding generation script again: |
|
|
|
|
|
```bash |
|
|
python -m scripts.generate_embeddings |
|
|
``` |
|
|
|
|
|
3. Rebuild the FAISS index: |
|
|
|
|
|
```bash |
|
|
python -m scripts.build_index |
|
|
``` |
|
|
|
|
|
After these steps, the API and frontend will use the updated version of |
|
|
the document corpus. |
|
|
|
|
|
## Configuration |
|
|
|
|
|
The `configs/config.json` file is the **single source of truth** for |
|
|
system parameters. Some important fields: |
|
|
|
|
|
- `embeddings.model_name` – embedding model. |
|
|
- `paths.input_path` – location of the `.md` documents. |
|
|
- `paths.embeddings_dir` and `paths.index_dir` – where to save embeddings |
|
|
and the index. |
|
|
- `retrieve.top_k` – default number of retrieved chunks. |
|
|
- `llm.provider`, `llm.model`, `llm.system_prompt` – LLM configuration. |
|
|
- `ui.api_url` – URL used by Streamlit to call the API. |
|
|
|
|
|
Whenever you change paths or models, update this file and, if necessary, |
|
|
regenerate embeddings and the index. |
|
|
|
|
|
--- |
|
|
|
|
|
# Details about the files |
|
|
|
|
|
Summary of the main directories and scripts in the repository: |
|
|
|
|
|
- `configs/` |
|
|
- `config.json` – general system parameters (paths, models, LLM, UI). |
|
|
|
|
|
- `utils/` |
|
|
- `base_utils.py` – helper functions (load config, read files, etc.). |
|
|
- `retrieval_utils.py` – embedding generation, text splitting and |
|
|
retrieval utilities. |
|
|
|
|
|
- `scripts/` |
|
|
- `normalize_md_titles.py` – normalizes `.md` titles, inserting or |
|
|
adjusting the first `#` heading with a clean title. |
|
|
- `generate_embeddings.py` – generates embeddings and consolidated |
|
|
files in `data/embeddings/`. |
|
|
- `build_index.py` – creates the FAISS index in `data/index/` from |
|
|
the embeddings. |
|
|
|
|
|
- `data/embeddings/` |
|
|
- `embeddings.npy` – embedding matrix (N x D). |
|
|
- `metadata.jsonl` – metadata for each chunk (index, document, |
|
|
fragment, content). |
|
|
|
|
|
- `data/index/` |
|
|
- `faiss.index` – FAISS vector index. |
|
|
- `metadata.jsonl` – copy of the metadata for production use. |
|
|
|
|
|
- `app/` |
|
|
- `api_server.py` – FastAPI service with the `/query` and |
|
|
`/list_documents` endpoints. |
|
|
- `app_front.py` – Streamlit application (Portuguese) with |
|
|
**Summaries** and **Chatbot** tabs. |
|
|
|
|
|
With these components, the repository provides a complete pipeline to |
|
|
build and operate a RAG chatbot specialized in NORM documents. |
|
|
|
|
|
Author: Breinner Espinosa. |
|
|
|