title: beta-NORM
emoji: 🧪
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
beta-NORM
NORM Chatbot – Intelligent Query System
About
The NORM Chatbot is a question–answering system based on Retrieval-Augmented Generation (RAG) designed for technical documents (for example, standards and reports related to NORM).
The goal is to allow the user to ask questions in natural language and receive answers in Portuguese, always explicitly citing the passages and documents that were used as the basis, with numbered references in the format [n].
Overall architecture
The system is composed of four main blocks:
Data preparation
.mddocuments are stored in a folder (by defaultDocs/). A normalization step ensures that each file has a clean title in the form of a Markdown heading (# Article title), which is later used in the interface and in the references.Embeddings and indexing
- The texts are split into chunks with overlap, according to the configured splitter.
- Each chunk is converted into an embedding vector using the model
configured in
configs/config.json(by defaultintfloat/multilingual-e5-large). - All vectors are consolidated in
embeddings.npyand the corresponding metadata inmetadata.jsonl, includingdocument_id,document_title,fragment_idand the content of the chunk. - A vector index is created with FAISS and saved in
data/index.
RAG API (backend)
A FastAPI service exposes:POST /query– receives a question, retrieves the most relevant chunks via FAISS and calls the LLM to generate the answer.GET /list_documents– lists the documents present in the corpus.
Web interface (frontend)
A Streamlit application consumes the API and offers two tabs:- Document summaries – generates summaries from the corpus.
- Interactive chatbot – free Q&A over the documents.
Getting started
Environment setup and installation
- Create and activate a Python virtual environment (3.10 or higher):
python3 -m venv venv
source venv/bin/activate # Linux / macOS
# .\venv\Scripts\activate # Windows PowerShell
- Install the dependencies from the
requirements.txtat the root of the project:
pip install --upgrade pip
pip install -r requirements.txt
- Make sure the
.mddocuments are in the folder configured inconfigs/config.json(by default,Docs/*.md).
If the files come from heterogeneous sources (titles on the first line, in the middle, in UPPERCASE, etc.), it is recommended to first run the title normalization script (see next section).
Basic usage pipeline
Always run commands from the root of the repository.
1. Normalize .md titles (optional but recommended)
For large and heterogeneous corpora, it is important to ensure that
each .md file has a clean title in a # ... heading. The repository
includes a script that attempts to infer and standardize these titles
based on the file name and content:
python -m scripts.normalize_md_titles
This step adjusts the .md files in Docs/, inserting or fixing the
first # heading of each document, which will later be displayed in the
document list in the frontend and used in the references.
2. Generate embeddings
python -m scripts.generate_embeddings
This script will:
- Read the
.mdfiles defined inpaths.input_path. - Split the text into chunks, according to the
splitterdefined in the config. - Generate embeddings with the configured model.
- Save the consolidated artifacts in
data/embeddings/:embeddings.npymetadata.jsonl
2. Build the FAISS index
python -m scripts.build_index
This script will:
- Read
data/embeddings/embeddings.npy. - Normalize the vectors.
- Build a
faiss.IndexFlatIPindex. - Save the index to
data/index/faiss.index. - Copy
metadata.jsonltodata/index/metadata.jsonl.
3. Run the API (backend)
Before starting the API, set the OpenAI key (if you are going to use the real LLM):
export OPENAI_API_KEY="YOUR_KEY_HERE" # Linux / macOS
# setx OPENAI_API_KEY "YOUR_KEY_HERE" # Windows (PowerShell)
Then run:
uvicorn app.api_server:app --reload --port 8000
Main endpoints:
GET /list_documents– returns the list of indexed documents withidandtitle(derived from the#heading of each.md).POST /query– receives a JSON with the question (question), the number of chunks to retrieve (top_k) and, optionally, the model temperature (temperature), and returns:answer– the answer text, already citing the sources in the [n] format.retrieved– list of chunks used in the answer, containingdocument_id,document_title,fragment_id,content, and a numericcitation_idthat corresponds to the [n] used in the text.
The backend builds a context with retrieved chunks, assigns a citation
number per document (citation_id) and passes a reference table to the
LLM. The model is instructed, via the system_prompt, to cite only
these documents using the [n] format and not to fabricate data,
especially for years after 2025.
4. Run the Streamlit frontend
In another terminal (also at the project root and with the venv activated):
streamlit run app/app_front.py
The interface will open in the default browser, usually at
http://localhost:8501.
In the Document Summaries tab, you can list the indexed documents,
click on a title (the text comes from the # heading of the .md) and
generate a short summary with coherent [n] references.
In the Interactive Chatbot tab, you can ask free-form questions about the corpus; the answers also contain [n] citations, and the “Source documents” and “Cited references” sections show only the documents actually cited in the text.
Management and development
Add or update documents
Add or update the
.mdfiles in the folder configured inpaths.input_path(by default,Docs/).Run the embedding generation script again:
python -m scripts.generate_embeddingsRebuild the FAISS index:
python -m scripts.build_index
After these steps, the API and frontend will use the updated version of the document corpus.
Configuration
The configs/config.json file is the single source of truth for
system parameters. Some important fields:
embeddings.model_name– embedding model.paths.input_path– location of the.mddocuments.paths.embeddings_dirandpaths.index_dir– where to save embeddings and the index.retrieve.top_k– default number of retrieved chunks.llm.provider,llm.model,llm.system_prompt– LLM configuration.ui.api_url– URL used by Streamlit to call the API.
Whenever you change paths or models, update this file and, if necessary, regenerate embeddings and the index.
Details about the files
Summary of the main directories and scripts in the repository:
configs/config.json– general system parameters (paths, models, LLM, UI).
utils/base_utils.py– helper functions (load config, read files, etc.).retrieval_utils.py– embedding generation, text splitting and retrieval utilities.
scripts/normalize_md_titles.py– normalizes.mdtitles, inserting or adjusting the first#heading with a clean title.generate_embeddings.py– generates embeddings and consolidated files indata/embeddings/.build_index.py– creates the FAISS index indata/index/from the embeddings.
data/embeddings/embeddings.npy– embedding matrix (N x D).metadata.jsonl– metadata for each chunk (index, document, fragment, content).
data/index/faiss.index– FAISS vector index.metadata.jsonl– copy of the metadata for production use.
app/api_server.py– FastAPI service with the/queryand/list_documentsendpoints.app_front.py– Streamlit application (Portuguese) with Summaries and Chatbot tabs.
With these components, the repository provides a complete pipeline to build and operate a RAG chatbot specialized in NORM documents.
Author: Breinner Espinosa.