beta-NORM / README.md
GitHub Actions
Snapshot from GitHub master for HF Space
6f54a86
metadata
title: beta-NORM
emoji: 🧪
colorFrom: blue
colorTo: green
sdk: docker
pinned: false

beta-NORM

NORM Chatbot – Intelligent Query System

About

The NORM Chatbot is a question–answering system based on Retrieval-Augmented Generation (RAG) designed for technical documents (for example, standards and reports related to NORM).

The goal is to allow the user to ask questions in natural language and receive answers in Portuguese, always explicitly citing the passages and documents that were used as the basis, with numbered references in the format [n].

Overall architecture

The system is composed of four main blocks:

  1. Data preparation
    .md documents are stored in a folder (by default Docs/). A normalization step ensures that each file has a clean title in the form of a Markdown heading (# Article title), which is later used in the interface and in the references.

  2. Embeddings and indexing

    • The texts are split into chunks with overlap, according to the configured splitter.
    • Each chunk is converted into an embedding vector using the model configured in configs/config.json (by default intfloat/multilingual-e5-large).
    • All vectors are consolidated in embeddings.npy and the corresponding metadata in metadata.jsonl, including document_id, document_title, fragment_id and the content of the chunk.
    • A vector index is created with FAISS and saved in data/index.
  3. RAG API (backend)
    A FastAPI service exposes:

    • POST /query – receives a question, retrieves the most relevant chunks via FAISS and calls the LLM to generate the answer.
    • GET /list_documents – lists the documents present in the corpus.
  4. Web interface (frontend)
    A Streamlit application consumes the API and offers two tabs:

    • Document summaries – generates summaries from the corpus.
    • Interactive chatbot – free Q&A over the documents.

Getting started

Environment setup and installation

  1. Create and activate a Python virtual environment (3.10 or higher):
python3 -m venv venv
source venv/bin/activate  # Linux / macOS
# .\venv\Scripts\activate  # Windows PowerShell
  1. Install the dependencies from the requirements.txt at the root of the project:
pip install --upgrade pip
pip install -r requirements.txt
  1. Make sure the .md documents are in the folder configured in configs/config.json (by default, Docs/*.md).
    If the files come from heterogeneous sources (titles on the first line, in the middle, in UPPERCASE, etc.), it is recommended to first run the title normalization script (see next section).

Basic usage pipeline

Always run commands from the root of the repository.

1. Normalize .md titles (optional but recommended)

For large and heterogeneous corpora, it is important to ensure that each .md file has a clean title in a # ... heading. The repository includes a script that attempts to infer and standardize these titles based on the file name and content:

python -m scripts.normalize_md_titles

This step adjusts the .md files in Docs/, inserting or fixing the first # heading of each document, which will later be displayed in the document list in the frontend and used in the references.

2. Generate embeddings

python -m scripts.generate_embeddings

This script will:

  • Read the .md files defined in paths.input_path.
  • Split the text into chunks, according to the splitter defined in the config.
  • Generate embeddings with the configured model.
  • Save the consolidated artifacts in data/embeddings/:
    • embeddings.npy
    • metadata.jsonl

2. Build the FAISS index

python -m scripts.build_index

This script will:

  • Read data/embeddings/embeddings.npy.
  • Normalize the vectors.
  • Build a faiss.IndexFlatIP index.
  • Save the index to data/index/faiss.index.
  • Copy metadata.jsonl to data/index/metadata.jsonl.

3. Run the API (backend)

Before starting the API, set the OpenAI key (if you are going to use the real LLM):

export OPENAI_API_KEY="YOUR_KEY_HERE"  # Linux / macOS
# setx OPENAI_API_KEY "YOUR_KEY_HERE"   # Windows (PowerShell)

Then run:

uvicorn app.api_server:app --reload --port 8000

Main endpoints:

  • GET /list_documents – returns the list of indexed documents with id and title (derived from the # heading of each .md).
  • POST /query – receives a JSON with the question (question), the number of chunks to retrieve (top_k) and, optionally, the model temperature (temperature), and returns:
    • answer – the answer text, already citing the sources in the [n] format.
    • retrieved – list of chunks used in the answer, containing document_id, document_title, fragment_id, content, and a numeric citation_id that corresponds to the [n] used in the text.

The backend builds a context with retrieved chunks, assigns a citation number per document (citation_id) and passes a reference table to the LLM. The model is instructed, via the system_prompt, to cite only these documents using the [n] format and not to fabricate data, especially for years after 2025.

4. Run the Streamlit frontend

In another terminal (also at the project root and with the venv activated):

streamlit run app/app_front.py

The interface will open in the default browser, usually at http://localhost:8501.

In the Document Summaries tab, you can list the indexed documents, click on a title (the text comes from the # heading of the .md) and generate a short summary with coherent [n] references.

In the Interactive Chatbot tab, you can ask free-form questions about the corpus; the answers also contain [n] citations, and the “Source documents” and “Cited references” sections show only the documents actually cited in the text.


Management and development

Add or update documents

  1. Add or update the .md files in the folder configured in paths.input_path (by default, Docs/).

  2. Run the embedding generation script again:

    python -m scripts.generate_embeddings
    
  3. Rebuild the FAISS index:

    python -m scripts.build_index
    

After these steps, the API and frontend will use the updated version of the document corpus.

Configuration

The configs/config.json file is the single source of truth for system parameters. Some important fields:

  • embeddings.model_name – embedding model.
  • paths.input_path – location of the .md documents.
  • paths.embeddings_dir and paths.index_dir – where to save embeddings and the index.
  • retrieve.top_k – default number of retrieved chunks.
  • llm.provider, llm.model, llm.system_prompt – LLM configuration.
  • ui.api_url – URL used by Streamlit to call the API.

Whenever you change paths or models, update this file and, if necessary, regenerate embeddings and the index.


Details about the files

Summary of the main directories and scripts in the repository:

  • configs/

    • config.json – general system parameters (paths, models, LLM, UI).
  • utils/

    • base_utils.py – helper functions (load config, read files, etc.).
    • retrieval_utils.py – embedding generation, text splitting and retrieval utilities.
  • scripts/

    • normalize_md_titles.py – normalizes .md titles, inserting or adjusting the first # heading with a clean title.
    • generate_embeddings.py – generates embeddings and consolidated files in data/embeddings/.
    • build_index.py – creates the FAISS index in data/index/ from the embeddings.
  • data/embeddings/

    • embeddings.npy – embedding matrix (N x D).
    • metadata.jsonl – metadata for each chunk (index, document, fragment, content).
  • data/index/

    • faiss.index – FAISS vector index.
    • metadata.jsonl – copy of the metadata for production use.
  • app/

    • api_server.py – FastAPI service with the /query and /list_documents endpoints.
    • app_front.py – Streamlit application (Portuguese) with Summaries and Chatbot tabs.

With these components, the repository provides a complete pipeline to build and operate a RAG chatbot specialized in NORM documents.

Author: Breinner Espinosa.