--- title: beta-NORM emoji: 🧪 colorFrom: blue colorTo: green sdk: docker pinned: false --- # beta-NORM # NORM Chatbot – Intelligent Query System - [About](#about) - [Overall architecture](#overall-architecture) - [Getting started](#getting-started) - [Environment setup and installation](#environment-setup-and-installation) - [Basic usage pipeline](#basic-usage-pipeline) - [Management and development](#management-and-development) - [Add or update documents](#add-or-update-documents) - [Configuration](#configuration) - [Details about the files](#details-about-the-files) # About The **NORM Chatbot** is a question–answering system based on **Retrieval-Augmented Generation (RAG)** designed for technical documents (for example, standards and reports related to NORM). The goal is to allow the user to ask questions in natural language and receive answers in Portuguese, always explicitly citing the passages and documents that were used as the basis, with numbered references in the format [n]. ## Overall architecture The system is composed of four main blocks: 1. **Data preparation** `.md` documents are stored in a folder (by default `Docs/`). A normalization step ensures that each file has a clean title in the form of a Markdown heading (`# Article title`), which is later used in the interface and in the references. 2. **Embeddings and indexing** - The texts are split into chunks with overlap, according to the configured splitter. - Each chunk is converted into an embedding vector using the model configured in `configs/config.json` (by default `intfloat/multilingual-e5-large`). - All vectors are consolidated in `embeddings.npy` and the corresponding metadata in `metadata.jsonl`, including `document_id`, `document_title`, `fragment_id` and the content of the chunk. - A vector index is created with **FAISS** and saved in `data/index`. 3. **RAG API (backend)** A **FastAPI** service exposes: - `POST /query` – receives a question, retrieves the most relevant chunks via FAISS and calls the LLM to generate the answer. - `GET /list_documents` – lists the documents present in the corpus. 4. **Web interface (frontend)** A **Streamlit** application consumes the API and offers two tabs: - **Document summaries** – generates summaries from the corpus. - **Interactive chatbot** – free Q&A over the documents. --- # Getting started ## Environment setup and installation 1. Create and activate a Python virtual environment (3.10 or higher): ```bash python3 -m venv venv source venv/bin/activate # Linux / macOS # .\venv\Scripts\activate # Windows PowerShell ``` 2. Install the dependencies from the `requirements.txt` at the root of the project: ```bash pip install --upgrade pip pip install -r requirements.txt ``` 3. Make sure the `.md` documents are in the folder configured in `configs/config.json` (by default, `Docs/*.md`). If the files come from heterogeneous sources (titles on the first line, in the middle, in UPPERCASE, etc.), it is recommended to first run the title normalization script (see next section). --- ## Basic usage pipeline Always run commands from the **root of the repository**. ### 1. Normalize `.md` titles (optional but recommended) For large and heterogeneous corpora, it is important to ensure that each `.md` file has a clean title in a `# ...` heading. The repository includes a script that attempts to infer and standardize these titles based on the file name and content: ```bash python -m scripts.normalize_md_titles ``` This step adjusts the `.md` files in `Docs/`, inserting or fixing the first `#` heading of each document, which will later be displayed in the document list in the frontend and used in the references. ### 2. Generate embeddings ```bash python -m scripts.generate_embeddings ``` This script will: - Read the `.md` files defined in `paths.input_path`. - Split the text into chunks, according to the `splitter` defined in the config. - Generate embeddings with the configured model. - Save the consolidated artifacts in `data/embeddings/`: - `embeddings.npy` - `metadata.jsonl` ### 2. Build the FAISS index ```bash python -m scripts.build_index ``` This script will: - Read `data/embeddings/embeddings.npy`. - Normalize the vectors. - Build a `faiss.IndexFlatIP` index. - Save the index to `data/index/faiss.index`. - Copy `metadata.jsonl` to `data/index/metadata.jsonl`. ### 3. Run the API (backend) Before starting the API, set the OpenAI key (if you are going to use the real LLM): ```bash export OPENAI_API_KEY="YOUR_KEY_HERE" # Linux / macOS # setx OPENAI_API_KEY "YOUR_KEY_HERE" # Windows (PowerShell) ``` Then run: ```bash uvicorn app.api_server:app --reload --port 8000 ``` Main endpoints: - `GET /list_documents` – returns the list of indexed documents with `id` and `title` (derived from the `#` heading of each `.md`). - `POST /query` – receives a JSON with the question (`question`), the number of chunks to retrieve (`top_k`) and, optionally, the model temperature (`temperature`), and returns: - `answer` – the answer text, already citing the sources in the [n] format. - `retrieved` – list of chunks used in the answer, containing `document_id`, `document_title`, `fragment_id`, `content`, and a numeric `citation_id` that corresponds to the [n] used in the text. The backend builds a context with retrieved chunks, assigns a citation number per document (`citation_id`) and passes a reference table to the LLM. The model is instructed, via the `system_prompt`, to cite only these documents using the [n] format and **not to fabricate data**, especially for years after 2025. ### 4. Run the Streamlit frontend In another terminal (also at the project root and with the venv activated): ```bash streamlit run app/app_front.py ``` The interface will open in the default browser, usually at `http://localhost:8501`. In the **Document Summaries** tab, you can list the indexed documents, click on a title (the text comes from the `#` heading of the `.md`) and generate a short summary with coherent [n] references. In the **Interactive Chatbot** tab, you can ask free-form questions about the corpus; the answers also contain [n] citations, and the “Source documents” and “Cited references” sections show only the documents actually cited in the text. --- # Management and development ## Add or update documents 1. Add or update the `.md` files in the folder configured in `paths.input_path` (by default, `Docs/`). 2. Run the embedding generation script again: ```bash python -m scripts.generate_embeddings ``` 3. Rebuild the FAISS index: ```bash python -m scripts.build_index ``` After these steps, the API and frontend will use the updated version of the document corpus. ## Configuration The `configs/config.json` file is the **single source of truth** for system parameters. Some important fields: - `embeddings.model_name` – embedding model. - `paths.input_path` – location of the `.md` documents. - `paths.embeddings_dir` and `paths.index_dir` – where to save embeddings and the index. - `retrieve.top_k` – default number of retrieved chunks. - `llm.provider`, `llm.model`, `llm.system_prompt` – LLM configuration. - `ui.api_url` – URL used by Streamlit to call the API. Whenever you change paths or models, update this file and, if necessary, regenerate embeddings and the index. --- # Details about the files Summary of the main directories and scripts in the repository: - `configs/` - `config.json` – general system parameters (paths, models, LLM, UI). - `utils/` - `base_utils.py` – helper functions (load config, read files, etc.). - `retrieval_utils.py` – embedding generation, text splitting and retrieval utilities. - `scripts/` - `normalize_md_titles.py` – normalizes `.md` titles, inserting or adjusting the first `#` heading with a clean title. - `generate_embeddings.py` – generates embeddings and consolidated files in `data/embeddings/`. - `build_index.py` – creates the FAISS index in `data/index/` from the embeddings. - `data/embeddings/` - `embeddings.npy` – embedding matrix (N x D). - `metadata.jsonl` – metadata for each chunk (index, document, fragment, content). - `data/index/` - `faiss.index` – FAISS vector index. - `metadata.jsonl` – copy of the metadata for production use. - `app/` - `api_server.py` – FastAPI service with the `/query` and `/list_documents` endpoints. - `app_front.py` – Streamlit application (Portuguese) with **Summaries** and **Chatbot** tabs. With these components, the repository provides a complete pipeline to build and operate a RAG chatbot specialized in NORM documents. Author: Breinner Espinosa.