---
title: NLP RAG
emoji: 🏢
colorFrom: gray
colorTo: green
sdk: docker
pinned: false
license: mit
short_description: NLP Spring 2026 Project 1
---
RAG-based Question-Answering System for Cognitive Behavior Therapy (CBT)
## Overview
This project is a Retrieval-Augmented Generation (RAG) system built to answer CBT-related questions using grounded evidence from source manuals instead of relying on generic model knowledge. It combines hybrid retrieval, re-ranking, and strict response constraints so the assistant stays accurate, clinically focused, and less prone to hallucinations.
## Index
- [Overview](#overview)
- [Live Demo and Repository](#live-demo-and-repository)
- [Live Web Interface](#live-web-interface)
- [Tech Stack](#tech-stack)
- [System Architecture](#system-architecture)
- [Key Features](#key-features)
- [Installation and Setup](#installation-and-setup)
- [Configuration](#configuration)
- [Testing](#testing)
- [Running the Main Pipeline](#running-the-main-pipeline)
- [Contributors](#contributors)
## Live Demo and Repository
- Live Demo: https://rag-as-3-nlp.vercel.app/
- Code Repository: https://github.com/ramailkk/RAG-AS3-NLP
## Live Web Interface
## Tech Stack
- Frontend: Vercel (Node.js/React)
- Backend: Hugging Face Spaces (FastAPI)
- Vector Database: Pinecone
- Embeddings: jinaai/jina-embeddings-v2-small-en
- LLMs: Llama-3-8B (Primary), TinyAya, Mistral-7B, Qwen-2.5
- Re-ranking: Voyage AI (rerank-2.5) and Cross-Encoder (ms-marco-MiniLM-L-6-v2)
- Retrieval: Hybrid Search (Dense + BM25 Sparse)
## System Architecture
The system operates through a high-precision multi-stage pipeline to ensure clinical safety and data grounding:
- Hybrid Retrieval: Simultaneously queries dense vector indices for semantic intent and sparse BM25 indices for specific clinical terminology such as Socratic Questioning or Cognitive Distortions.
- Fusion & Re-ranking: Uses Reciprocal Rank Fusion (RRF) to merge results, followed by a Cross-Encoder stage to re-evaluate the relevance of chunks against the user query.
- Diversity Filtering (MMR): Implements Maximal Marginal Relevance to ensure the context provided to the LLM is not redundant.
- Prompt Engineering: Employs a specialized persona that acts as an empathetic CBT therapist with strict grounding constraints to prevent the use of outside knowledge.
- Automated Evaluation: An LLM-as-a-Judge framework calculates:
- Faithfulness: Verifying claims against the source document.
- Relevancy: Ensuring the answer directly addresses the user's query.
## Key Features
- Clinical Domain Focus: Optimized for high-density information found in mental health manuals.
- Zero Tolerance for Hallucinations: Includes a fallback protocol to state when information is missing rather than inventing therapeutic advice.
- Advanced Chunking: Uses sentence-level and recursive character splitting to preserve the logical flow of therapeutic guidelines and patient transcripts.
- Multi-Model Support: Tested across multiple LLMs to find the best balance between latency and grounding.
## Installation and Setup
### Backend Setup
The backend handles document processing, Pinecone vector operations, and the hybrid retrieval logic.
1. Initialize Virtual Environment:
```bash
python -m venv .venv
# Windows
source .venv/Scripts/activate
# Linux/Mac
source .venv/bin/activate
```
2. Install Dependencies:
```bash
pip install -r requirements.txt
```
3. Launch API Server:
```bash
uvicorn backend.api:app --reload --host 0.0.0.0 --port 8000
```
### Frontend Setup
The frontend provides the interactive chat interface and real-time evaluation scores.
1. Navigate and Install:
```bash
cd frontend
npm install
```
2. Start Development Server:
```bash
npm run dev
```
## Configuration
To replicate the system, ensure your environment variables contain valid API keys for:
- Pinecone for vector storage
- OpenRouter or Hugging Face Inference API for LLM access
- Voyage AI for re-ranking
## Testing
Run `test.py` to benchmark the chunking strategies and retrieval configurations, then generate a complete Markdown report of the results.
```bash
python test.py
```
This script evaluates multiple test queries across the configured chunking techniques and retrieval strategies, then writes the full output to `retrieval_report.md`. Use that report to choose the best chunking strategy and retrieval configuration.
### Key variables you can change in `test.py`
- `test_queries`: the questions used for benchmarking.
- `CHUNKING_TECHNIQUES_FILTERED`: the chunking strategies included in the report.
- `RETRIEVAL_STRATEGIES`: the retrieval modes and MMR settings being compared.
- `index_name`: the Pinecone index that stores the chunked data.
- `top_k` and `final_k`: how many candidates are retrieved and how many are kept in the final context.
## Running the Main Pipeline
After testing, run `main.py` to reproduce the main experiment with the selected configuration and evaluate faithfulness and relevancy across the model set. This script is part of the reproducibility workflow, since changing its configuration lets you rerun the same evaluation under different chunking, retrieval, and model settings.
```bash
python main.py
```
This step runs the end-to-end comparison flow for all models, measures faithfulness and relevancy for each one, and writes the detailed findings to `rag_ablation_findings.md`.
### Key variables you can change in `main.py`
- `CHUNKING_TECHNIQUES` or the technique filter used in the script: controls which chunking methods are evaluated.
- `test_queries`: the query set used for the ablation study.
- `MODEL_MAP`: the model lineup being compared.
- `retrieval_strategy`: the retrieval mode, MMR setting, and label for each run.
- `top_k` and `final_k`: candidate retrieval depth and final context size.
- `temperature` in `cfg.gen`: generation randomness for the model outputs.
- `output_file`: the markdown report written by the run, usually `rag_ablation_findings.md`.
## Contributors
- Ramail Khan ([ramailkk](https://github.com/ramailkk))
- Qamar Raza ([Qar-Raz](https://github.com/Qar-Raz))
- Muddasir Javed ([bsparx](https://github.com/bsparx))