Spaces:

ICA-PUC
/

beta-NORM

Sleeping

App Files Files Community

beta-NORM / README.md

GitHub Actions

Snapshot from GitHub master for HF Space

6f54a86 about 1 month ago

preview code

raw

history blame contribute delete

8.81 kB

	---
	title: beta-NORM
	emoji: 🧪
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	---

	# beta-NORM

	# NORM Chatbot – Intelligent Query System

	- [About](#about)
	- [Overall architecture](#overall-architecture)
	- [Getting started](#getting-started)
	- [Environment setup and installation](#environment-setup-and-installation)
	- [Basic usage pipeline](#basic-usage-pipeline)
	- [Management and development](#management-and-development)
	- [Add or update documents](#add-or-update-documents)
	- [Configuration](#configuration)
	- [Details about the files](#details-about-the-files)

	# About

	The NORM Chatbot is a question–answering system based on
	Retrieval-Augmented Generation (RAG) designed for technical documents
	(for example, standards and reports related to NORM).

	The goal is to allow the user to ask questions in natural language and
	receive answers in Portuguese, always explicitly citing the passages and
	documents that were used as the basis, with numbered references in the
	format [n].

	## Overall architecture

	The system is composed of four main blocks:

	1. Data preparation
	`.md` documents are stored in a folder (by default `Docs/`). A
	normalization step ensures that each file has a clean title in the
	form of a Markdown heading (`# Article title`), which is later used
	in the interface and in the references.

	2. Embeddings and indexing
	- The texts are split into chunks with overlap, according to the
	configured splitter.
	- Each chunk is converted into an embedding vector using the model
	configured in `configs/config.json` (by default
	`intfloat/multilingual-e5-large`).
	- All vectors are consolidated in `embeddings.npy` and the
	corresponding metadata in `metadata.jsonl`, including
	`document_id`, `document_title`, `fragment_id` and the content of
	the chunk.
	- A vector index is created with FAISS and saved in `data/index`.

	3. RAG API (backend)
	A FastAPI service exposes:
	- `POST /query` – receives a question, retrieves the most relevant
	chunks via FAISS and calls the LLM to generate the answer.
	- `GET /list_documents` – lists the documents present in the corpus.

	4. Web interface (frontend)
	A Streamlit application consumes the API and offers two tabs:
	- Document summaries – generates summaries from the corpus.
	- Interactive chatbot – free Q&A over the documents.

	---

	# Getting started

	## Environment setup and installation

	1. Create and activate a Python virtual environment (3.10 or higher):

	```bash
	python3 -m venv venv
	source venv/bin/activate # Linux / macOS
	# .\venv\Scripts\activate # Windows PowerShell
	```

	2. Install the dependencies from the `requirements.txt` at the root of
	the project:

	```bash
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	3. Make sure the `.md` documents are in the folder configured in
	`configs/config.json` (by default, `Docs/*.md`).
	If the files come from heterogeneous sources (titles on the first
	line, in the middle, in UPPERCASE, etc.), it is recommended to first
	run the title normalization script (see next section).

	---

	## Basic usage pipeline

	Always run commands from the root of the repository.

	### 1. Normalize `.md` titles (optional but recommended)

	For large and heterogeneous corpora, it is important to ensure that
	each `.md` file has a clean title in a `# ...` heading. The repository
	includes a script that attempts to infer and standardize these titles
	based on the file name and content:

	```bash
	python -m scripts.normalize_md_titles
	```

	This step adjusts the `.md` files in `Docs/`, inserting or fixing the
	first `#` heading of each document, which will later be displayed in the
	document list in the frontend and used in the references.

	### 2. Generate embeddings

	```bash
	python -m scripts.generate_embeddings
	```

	This script will:

	- Read the `.md` files defined in `paths.input_path`.
	- Split the text into chunks, according to the `splitter` defined in
	the config.
	- Generate embeddings with the configured model.
	- Save the consolidated artifacts in `data/embeddings/`:
	- `embeddings.npy`
	- `metadata.jsonl`

	### 2. Build the FAISS index

	```bash
	python -m scripts.build_index
	```

	This script will:

	- Read `data/embeddings/embeddings.npy`.
	- Normalize the vectors.
	- Build a `faiss.IndexFlatIP` index.
	- Save the index to `data/index/faiss.index`.
	- Copy `metadata.jsonl` to `data/index/metadata.jsonl`.

	### 3. Run the API (backend)

	Before starting the API, set the OpenAI key (if you are going to use the
	real LLM):

	```bash
	export OPENAI_API_KEY="YOUR_KEY_HERE" # Linux / macOS
	# setx OPENAI_API_KEY "YOUR_KEY_HERE" # Windows (PowerShell)
	```

	Then run:

	```bash
	uvicorn app.api_server:app --reload --port 8000
	```

	Main endpoints:

	- `GET /list_documents` – returns the list of indexed documents with
	`id` and `title` (derived from the `#` heading of each `.md`).
	- `POST /query` – receives a JSON with the question (`question`), the
	number of chunks to retrieve (`top_k`) and, optionally, the model
	temperature (`temperature`), and returns:
	- `answer` – the answer text, already citing the sources in the [n]
	format.
	- `retrieved` – list of chunks used in the answer, containing
	`document_id`, `document_title`, `fragment_id`, `content`, and a
	numeric `citation_id` that corresponds to the [n] used in the
	text.

	The backend builds a context with retrieved chunks, assigns a citation
	number per document (`citation_id`) and passes a reference table to the
	LLM. The model is instructed, via the `system_prompt`, to cite only
	these documents using the [n] format and not to fabricate data,
	especially for years after 2025.

	### 4. Run the Streamlit frontend

	In another terminal (also at the project root and with the venv
	activated):

	```bash
	streamlit run app/app_front.py
	```

	The interface will open in the default browser, usually at
	`http://localhost:8501`.

	In the Document Summaries tab, you can list the indexed documents,
	click on a title (the text comes from the `#` heading of the `.md`) and
	generate a short summary with coherent [n] references.

	In the Interactive Chatbot tab, you can ask free-form questions
	about the corpus; the answers also contain [n] citations, and the
	“Source documents” and “Cited references” sections show only the
	documents actually cited in the text.

	---

	# Management and development

	## Add or update documents

	1. Add or update the `.md` files in the folder configured in
	`paths.input_path` (by default, `Docs/`).
	2. Run the embedding generation script again:

	```bash
	python -m scripts.generate_embeddings
	```

	3. Rebuild the FAISS index:

	```bash
	python -m scripts.build_index
	```

	After these steps, the API and frontend will use the updated version of
	the document corpus.

	## Configuration

	The `configs/config.json` file is the single source of truth for
	system parameters. Some important fields:

	- `embeddings.model_name` – embedding model.
	- `paths.input_path` – location of the `.md` documents.
	- `paths.embeddings_dir` and `paths.index_dir` – where to save embeddings
	and the index.
	- `retrieve.top_k` – default number of retrieved chunks.
	- `llm.provider`, `llm.model`, `llm.system_prompt` – LLM configuration.
	- `ui.api_url` – URL used by Streamlit to call the API.

	Whenever you change paths or models, update this file and, if necessary,
	regenerate embeddings and the index.

	---

	# Details about the files

	Summary of the main directories and scripts in the repository:

	- `configs/`
	- `config.json` – general system parameters (paths, models, LLM, UI).

	- `utils/`
	- `base_utils.py` – helper functions (load config, read files, etc.).
	- `retrieval_utils.py` – embedding generation, text splitting and
	retrieval utilities.

	- `scripts/`
	- `normalize_md_titles.py` – normalizes `.md` titles, inserting or
	adjusting the first `#` heading with a clean title.
	- `generate_embeddings.py` – generates embeddings and consolidated
	files in `data/embeddings/`.
	- `build_index.py` – creates the FAISS index in `data/index/` from
	the embeddings.

	- `data/embeddings/`
	- `embeddings.npy` – embedding matrix (N x D).
	- `metadata.jsonl` – metadata for each chunk (index, document,
	fragment, content).

	- `data/index/`
	- `faiss.index` – FAISS vector index.
	- `metadata.jsonl` – copy of the metadata for production use.

	- `app/`
	- `api_server.py` – FastAPI service with the `/query` and
	`/list_documents` endpoints.
	- `app_front.py` – Streamlit application (Portuguese) with
	Summaries and Chatbot tabs.

	With these components, the repository provides a complete pipeline to
	build and operate a RAG chatbot specialized in NORM documents.

	Author: Breinner Espinosa.