beta-NORM / README.md
GitHub Actions
Snapshot from GitHub master for HF Space
6f54a86
---
title: beta-NORM
emoji: 🧪
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---
# beta-NORM
# NORM Chatbot – Intelligent Query System
- [About](#about)
- [Overall architecture](#overall-architecture)
- [Getting started](#getting-started)
- [Environment setup and installation](#environment-setup-and-installation)
- [Basic usage pipeline](#basic-usage-pipeline)
- [Management and development](#management-and-development)
- [Add or update documents](#add-or-update-documents)
- [Configuration](#configuration)
- [Details about the files](#details-about-the-files)
# About
The **NORM Chatbot** is a question–answering system based on
**Retrieval-Augmented Generation (RAG)** designed for technical documents
(for example, standards and reports related to NORM).
The goal is to allow the user to ask questions in natural language and
receive answers in Portuguese, always explicitly citing the passages and
documents that were used as the basis, with numbered references in the
format [n].
## Overall architecture
The system is composed of four main blocks:
1. **Data preparation**
`.md` documents are stored in a folder (by default `Docs/`). A
normalization step ensures that each file has a clean title in the
form of a Markdown heading (`# Article title`), which is later used
in the interface and in the references.
2. **Embeddings and indexing**
- The texts are split into chunks with overlap, according to the
configured splitter.
- Each chunk is converted into an embedding vector using the model
configured in `configs/config.json` (by default
`intfloat/multilingual-e5-large`).
- All vectors are consolidated in `embeddings.npy` and the
corresponding metadata in `metadata.jsonl`, including
`document_id`, `document_title`, `fragment_id` and the content of
the chunk.
- A vector index is created with **FAISS** and saved in `data/index`.
3. **RAG API (backend)**
A **FastAPI** service exposes:
- `POST /query` – receives a question, retrieves the most relevant
chunks via FAISS and calls the LLM to generate the answer.
- `GET /list_documents` – lists the documents present in the corpus.
4. **Web interface (frontend)**
A **Streamlit** application consumes the API and offers two tabs:
- **Document summaries** – generates summaries from the corpus.
- **Interactive chatbot** – free Q&A over the documents.
---
# Getting started
## Environment setup and installation
1. Create and activate a Python virtual environment (3.10 or higher):
```bash
python3 -m venv venv
source venv/bin/activate # Linux / macOS
# .\venv\Scripts\activate # Windows PowerShell
```
2. Install the dependencies from the `requirements.txt` at the root of
the project:
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
3. Make sure the `.md` documents are in the folder configured in
`configs/config.json` (by default, `Docs/*.md`).
If the files come from heterogeneous sources (titles on the first
line, in the middle, in UPPERCASE, etc.), it is recommended to first
run the title normalization script (see next section).
---
## Basic usage pipeline
Always run commands from the **root of the repository**.
### 1. Normalize `.md` titles (optional but recommended)
For large and heterogeneous corpora, it is important to ensure that
each `.md` file has a clean title in a `# ...` heading. The repository
includes a script that attempts to infer and standardize these titles
based on the file name and content:
```bash
python -m scripts.normalize_md_titles
```
This step adjusts the `.md` files in `Docs/`, inserting or fixing the
first `#` heading of each document, which will later be displayed in the
document list in the frontend and used in the references.
### 2. Generate embeddings
```bash
python -m scripts.generate_embeddings
```
This script will:
- Read the `.md` files defined in `paths.input_path`.
- Split the text into chunks, according to the `splitter` defined in
the config.
- Generate embeddings with the configured model.
- Save the consolidated artifacts in `data/embeddings/`:
- `embeddings.npy`
- `metadata.jsonl`
### 2. Build the FAISS index
```bash
python -m scripts.build_index
```
This script will:
- Read `data/embeddings/embeddings.npy`.
- Normalize the vectors.
- Build a `faiss.IndexFlatIP` index.
- Save the index to `data/index/faiss.index`.
- Copy `metadata.jsonl` to `data/index/metadata.jsonl`.
### 3. Run the API (backend)
Before starting the API, set the OpenAI key (if you are going to use the
real LLM):
```bash
export OPENAI_API_KEY="YOUR_KEY_HERE" # Linux / macOS
# setx OPENAI_API_KEY "YOUR_KEY_HERE" # Windows (PowerShell)
```
Then run:
```bash
uvicorn app.api_server:app --reload --port 8000
```
Main endpoints:
- `GET /list_documents` – returns the list of indexed documents with
`id` and `title` (derived from the `#` heading of each `.md`).
- `POST /query` – receives a JSON with the question (`question`), the
number of chunks to retrieve (`top_k`) and, optionally, the model
temperature (`temperature`), and returns:
- `answer` – the answer text, already citing the sources in the [n]
format.
- `retrieved` – list of chunks used in the answer, containing
`document_id`, `document_title`, `fragment_id`, `content`, and a
numeric `citation_id` that corresponds to the [n] used in the
text.
The backend builds a context with retrieved chunks, assigns a citation
number per document (`citation_id`) and passes a reference table to the
LLM. The model is instructed, via the `system_prompt`, to cite only
these documents using the [n] format and **not to fabricate data**,
especially for years after 2025.
### 4. Run the Streamlit frontend
In another terminal (also at the project root and with the venv
activated):
```bash
streamlit run app/app_front.py
```
The interface will open in the default browser, usually at
`http://localhost:8501`.
In the **Document Summaries** tab, you can list the indexed documents,
click on a title (the text comes from the `#` heading of the `.md`) and
generate a short summary with coherent [n] references.
In the **Interactive Chatbot** tab, you can ask free-form questions
about the corpus; the answers also contain [n] citations, and the
“Source documents” and “Cited references” sections show only the
documents actually cited in the text.
---
# Management and development
## Add or update documents
1. Add or update the `.md` files in the folder configured in
`paths.input_path` (by default, `Docs/`).
2. Run the embedding generation script again:
```bash
python -m scripts.generate_embeddings
```
3. Rebuild the FAISS index:
```bash
python -m scripts.build_index
```
After these steps, the API and frontend will use the updated version of
the document corpus.
## Configuration
The `configs/config.json` file is the **single source of truth** for
system parameters. Some important fields:
- `embeddings.model_name` – embedding model.
- `paths.input_path` – location of the `.md` documents.
- `paths.embeddings_dir` and `paths.index_dir` – where to save embeddings
and the index.
- `retrieve.top_k` – default number of retrieved chunks.
- `llm.provider`, `llm.model`, `llm.system_prompt` – LLM configuration.
- `ui.api_url` – URL used by Streamlit to call the API.
Whenever you change paths or models, update this file and, if necessary,
regenerate embeddings and the index.
---
# Details about the files
Summary of the main directories and scripts in the repository:
- `configs/`
- `config.json` – general system parameters (paths, models, LLM, UI).
- `utils/`
- `base_utils.py` – helper functions (load config, read files, etc.).
- `retrieval_utils.py` – embedding generation, text splitting and
retrieval utilities.
- `scripts/`
- `normalize_md_titles.py` – normalizes `.md` titles, inserting or
adjusting the first `#` heading with a clean title.
- `generate_embeddings.py` – generates embeddings and consolidated
files in `data/embeddings/`.
- `build_index.py` – creates the FAISS index in `data/index/` from
the embeddings.
- `data/embeddings/`
- `embeddings.npy` – embedding matrix (N x D).
- `metadata.jsonl` – metadata for each chunk (index, document,
fragment, content).
- `data/index/`
- `faiss.index` – FAISS vector index.
- `metadata.jsonl` – copy of the metadata for production use.
- `app/`
- `api_server.py` – FastAPI service with the `/query` and
`/list_documents` endpoints.
- `app_front.py` – Streamlit application (Portuguese) with
**Summaries** and **Chatbot** tabs.
With these components, the repository provides a complete pipeline to
build and operate a RAG chatbot specialized in NORM documents.
Author: Breinner Espinosa.