vgecbot / CODEBASE_DOCUMENTATION.md
harsh-dev's picture
docker deployment
4225666
# VGEC RAG Chatbot β€” Codebase Documentation
> **Generated:** 2026-03-25
> **Version:** 1.0.0
> **Scope:** Full system β€” ingestion, retrieval, classification, API, evaluation
---
## Table of Contents
1. [Project Overview](#1-project-overview)
2. [System Architecture](#2-system-architecture)
3. [Schema & Data Model](#3-schema--data-model)
4. [Retrieval Pipeline](#4-retrieval-pipeline)
5. [Key Classes & Modules](#5-key-classes--modules)
6. [Evaluation & Metrics](#6-evaluation--metrics)
7. [Known Limitations](#7-known-limitations)
8. [File Structure](#8-file-structure)
---
## 1. Project Overview
### Purpose
**VGEC RAG Chatbot** is a Retrieval-Augmented Generation (RAG) chatbot for **Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat**. It allows students, faculty, and visitors to query structured information about the institution β€” departments, faculty, syllabus, labs, intake capacity, and more β€” through natural language.
### Domain
- **Institution:** VGEC (Government Engineering College, Gujarat)
- **Data Coverage:** Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.)
- **Topics:** Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements
### Tech Stack
| Layer | Technology |
|---|---|
| **API Framework** | FastAPI |
| **Vector Database** | ChromaDB (persistent, local) |
| **Embeddings** | Google `gemini-embedding-001` (via `langchain-google-genai`) |
| **LLM (Cloud)** | Google Gemini `gemini-2.5-flash-lite` |
| **LLM (Local)** | `EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf` via `llama-cpp-python` |
| **NLP / Preprocessing** | spaCy (`en_core_web_sm`), NLTK (PorterStemmer) |
| **Classifier** | Scikit-learn `LogisticRegression` + `SentenceTransformer` (`MongoDB/mdbr-leaf-mt`) |
| **BM25** | `langchain-community` `BM25Retriever` |
| **Chunking** | LangChain `RecursiveCharacterTextSplitter` |
| **Config** | Pydantic `BaseSettings` (`.env`-backed) |
### Key Features Implemented
- βœ… Structured JSON ingestion with intent-aware chunking
- βœ… Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF)
- βœ… Intent/metadata classification with confidence-gated ChromaDB filters
- βœ… Abbreviation expansion (`CE` β†’ `Computer Engineering`, etc.)
- βœ… Multi-turn conversation history support
- βœ… Dual LLM backend with automatic fallback (Gemini ↔ Local)
- βœ… Full CRUD REST API for vector store management
- βœ… Offline evaluation endpoint (MRR, hit rate, noise rate)
- βœ… Classifier accuracy evaluation endpoint
---
## 2. System Architecture
### Component Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI App β”‚
β”‚ /api/v1/rag /vector β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ DI (lru_cache)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAGService β”‚
β”‚ (core orchestrator) β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ IngestionServiceβ”‚ β”‚ HybridRetrievalServiceβ”‚
β”‚ (write path) β”‚ β”‚ (read path) β”‚
└──────┬──────── β”˜ └───┬──────────┬─────── β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FileService β”‚ β”‚ ClassifierSvcβ”‚ β”‚ VectorStore β”‚
β”‚ (file +meta) β”‚ β”‚(clf predict) β”‚ β”‚ (ChromaDB) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Data Flow
#### Ingestion Path
```
File Upload (PDF/MD/TXT/JSON)
β”‚
β–Ό
FileService.read_file() ← type-aware loading (PyMuPDF for PDF)
β”‚ returns: Document + metadata
β–Ό
FileService.write_file() ← persist copy to data/documents/
β”‚
β–Ό
IngestionService.handle_*_docs() ← route by file extension
β”‚
β”œβ”€ JSON β†’ handle_json_docs() ← intent-aware chunks (list / detail / count)
└─ text β†’ handle_text_docs() ← RecursiveCharacterTextSplitter + normalize()
β”‚
β–Ό
VectorStore.add_documents() ← embed + upsert into ChromaDB
β”‚
β–Ό
FileService.patch_metadata() ← update ingestion record JSON (chunk count, timing, size)
```
#### Query Path
```
User Question
β”‚
β–Ό
preprocess_query() ← tokenize + strip stopwords (spaCy) + normalize
β”‚
β–Ό
HybridRetrievalService.retrieve()
β”‚
β”œβ”€ clf.expand_abbreviations() ← CE β†’ Computer Engineering
β”œβ”€ clf.predict_with_filter() ← LogReg predict β†’ Chroma $and/$or filter
β”œβ”€ _vector_rank() ← ChromaDB similarity_search_with_score (k=15)
β”œβ”€ _bm25_rank() ← BM25 over the vector candidate pool
β”œβ”€ _reciprocal_rank_fusion() ← weighted RRF merge
β”œβ”€ metadata score boosting ← multiply fused scores for confident matches
└─ _apply_title_boost() ← per-query-word title match bonus
β”‚
β–Ό
get_references_v2() ← filter by threshold, build context string
β”‚
β–Ό
LLM.invoke(prompt) ← Gemini or local LlamaCpp
β”‚
β–Ό
Return: { answer, references, context, threshold_used, k_used }
```
### External Dependencies
| Dependency | Role | Provider |
|---|---|---|
| ChromaDB | Persistent vector store | Local disk |
| Google Gemini API | Embeddings + LLM generation | Google Cloud |
| LlamaCpp (GGUF model) | Local LLM fallback | Local CPU |
| Sentence Transformers | Classifier feature extraction | HuggingFace Hub |
| spaCy `en_core_web_sm` | POS tagging / lemmatization | Local |
---
## 3. Schema & Data Model
### Source JSON Format
Source data files (e.g. `computer_eng.json`) follow this schema:
```json
{
"id": "computer-engineering-department",
"name": "Computer Engineering Department",
"source": "https://www.vgecg.ac.in/department.php?dept=3",
"category": "computer_eng",
"type": "department",
"created_date": "2026-02-19",
"content": {
"<topic_key>": {
"list": ["item 1", "item 2", "..."],
"details": "Paragraph describing the topic."
}
}
}
```
**Top-level fields:**
| Field | Type | Description |
|---|---|---|
| `id` | string | Unique document identifier |
| `name` | string | Human-readable institution/department name |
| `source` | string | Authoritative URL |
| `category` | string | Department slug (e.g. `computer_eng`) |
| `type` | string | Document type (e.g. `department`) |
| `created_date` | string (ISO) | Data creation date |
| `content` | object | Topic map; each key = a topic |
### Chunk Metadata Schema (stored in ChromaDB)
Every vector chunk stored in Chroma carries the following metadata:
| Field | Type | Source |
|---|---|---|
| `id` | string (UUID) | Auto-generated |
| `title` | string | Document name / topic key |
| `source` | string | Source URL |
| `source_file` | string | Filename (e.g. `computer_eng.json`) |
| `type` | string | Taxonomy level 1 (e.g. `department`) |
| `category` | string | Taxonomy level 2 (e.g. `computer_eng`) |
| `topic` | string | Taxonomy level 3 (e.g. `faculty`) |
| `intent` | string | Chunk intent: `list`, `detail`, or `count` |
| `chunk_index` | int | Sequential index within file |
| `created_date` | string (ISO) | Ingestion timestamp |
| `updated_at` | string (ISO) | Last modification timestamp |
| `ext` | string | Source file extension (`json`, `pdf`, `md`, `txt`) |
### Hierarchical Taxonomy
The classifier predicts and ChromaDB filters operate on a 3-level hierarchy:
```
type
└── category
└── topic
└── intent (list | detail | count)
```
**Example mapping (Computer Engineering):**
```
type: "department"
└── category: "computer_eng"
β”œβ”€β”€ topic: "faculty" β†’ intent: list | detail
β”œβ”€β”€ topic: "lab" β†’ intent: list | detail
β”œβ”€β”€ topic: "syllabus" β†’ intent: list | detail
β”œβ”€β”€ topic: "hod" β†’ intent: list | detail
β”œβ”€β”€ topic: "intake" β†’ intent: list | detail
β”œβ”€β”€ topic: "research" β†’ intent: list | detail
└── topic: "achievements"
```
### Document Chunking Strategy
**JSON documents** use a hand-crafted, intent-aware strategy in `IngestionService.handle_json_docs()`:
| Intent | Chunk Content | Metadata |
|---|---|---|
| `list` | Numbered list: `1. item\n2. item\n...` | `intent=list` |
| `count` | `"Total <topic>: N"` (auto-generated) | `intent=count` |
| `detail` | Raw paragraph text | `intent=detail` |
**Text/PDF/Markdown documents** use `RecursiveCharacterTextSplitter`:
- Default: `chunk_size=500`, `chunk_overlap=100`
- Separator priority: `\n\n` β†’ `\n` β†’ ` ` β†’ (character)
- Markdown variant respects `---` section delimiters
- Content is passed through `normalize()` (tokenize + strip blanks) before storage
---
## 4. Retrieval Pipeline
### Query Processing Flow
```python
# Step 1: Normalize input
question = preprocess_query(question)
# β†’ spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords
# Step 2: Expand abbreviations
processed_query = clf.expand_abbreviations(query)
# β†’ "CE dept" β†’ "computer engineering department"
# Step 3: Classify intent/metadata
filters = clf.predict_with_filter([processed_query])
# β†’ {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]}
# Step 4: Vector search with optional filter
raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters)
# Fallback: if filtered results empty, retry without filter
# Step 5: BM25 re-rank over vector candidates
bm25_results = BM25Retriever.from_documents(candidate_docs)
# Step 6: RRF fusion
fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
+ vector_weight * 1/(rrf_k + rank_vec)
# Step 7: Metadata confidence boosting
if doc.metadata[field] == predicted_val and conf > 0.90:
result.fused_score *= boost_factor # 1.10–1.20
# Step 8: Title word boost
for word in query_words:
if word in doc.title:
result.fused_score += title_boost_per_word # 0.004
# Step 9: Threshold filter + sort + top-k
results = [r for r in results if r.fused_score >= threshold]
```
### Classifier Thresholds
The `Classifier` uses two separate threshold tables:
**Prediction threshold** β€” below this, the field is set to `None` (not used at all):
| Field | Threshold |
|---|---|
| `type` | 0.40 |
| `category` | 0.40 |
| `topic` | 0.50 |
| `intent` | 0.60 |
**Filter threshold** β€” above this, the field becomes a hard ChromaDB `$and` filter:
| Field | Threshold |
|---|---|
| `type` | 0.65 |
| `category` | 0.65 |
| `topic` | 0.70 |
### Filter Construction Logic (`_build_filter`)
```python
# Gate: if type confidence < 0.65 β†’ return None (full scan)
# Hard anchors (always included if type passes):
# - type == predicted_type
# - intent == predicted_intent (special: "count" expands to count OR detail)
# Soft hints (combined as $or):
# - category == predicted_category (if conf >= 0.65, else "general")
# - topic == predicted_topic (if conf >= 0.70, else "general")
```
### Hybrid Retrieval Config (Defaults)
| Parameter | `hybrid_query` | `search_docs` |
|---|---|---|
| `candidate_k` | 15 | 15 |
| `top_k` (final) | `settings.similarity_top_k` (8) | k (param) |
| `bm25_weight` | 0.45 | 0.70 |
| `vector_weight` | 0.55 | 0.30 |
| `rrf_k` | 20 | 20 |
| `bm25_k1` | 1.2 | 1.5 |
| `bm25_b` | 0.9 | 0.75 |
| `title_boost_per_word` | 0.004 | 0.004 |
| `score_threshold` | 0.4 | 0.4 |
> **Note:** `search_docs` is BM25-heavy (0.70) since it is used for keyword-oriented document browsing, while `hybrid_query` is vector-heavy for semantic QA.
---
## 5. Key Classes & Modules
### Services (`app/services/`)
#### `RAGService`
Main orchestrator. Singleton via `lru_cache` in `dependencies.py`.
| Method | Description |
|---|---|
| `query()` | Semantic-only QA (vector search β†’ LLM) |
| `hybrid_query()` | Hybrid QA (BM25 + vector β†’ RRF β†’ LLM) |
| `search_docs()` | BM25-heavy document search, no LLM |
| `ingest_documents()` | Ingest a file path into the vector store |
| `get_filenames()` | Return all tracked file metadata records |
| `test_queries()` | Batch retrieval evaluation (MRR, precision, noise) |
| `test_classifier()` | Batch classifier accuracy evaluation |
| `delete_database()` | Drop the entire ChromaDB collection |
#### `HybridRetrievalService`
Stateless per-request service created inline by `RAGService`.
| Method | Description |
|---|---|
| `retrieve(query)` | Full hybrid retrieval pipeline; returns `List[RetrievalResult]` |
| `_vector_rank()` | Chroma similarity search + classifier filter |
| `_bm25_rank()` | BM25 over candidate pool |
| `_reciprocal_rank_fusion()` | Merge both ranked lists via RRF |
| `_apply_title_boost()` | Word-level title match score bonus |
**`RetrievalResult` dataclass:**
```python
@dataclass
class RetrievalResult:
document: Document
fused_score: float
bm25_rank: Optional[int]
vector_rank: Optional[int]
title_boost: float
```
#### `Classifier`
Loaded at startup from a pickled pipeline (`chatbot_classifier.pkl`).
| Method | Description |
|---|---|
| `predict(queries)` | Returns list of `{type, category, topic, intent, *_conf}` dicts |
| `predict_with_filter(queries)` | Returns a ChromaDB-compatible filter dict or `None` |
| `expand_abbreviations(text)` | Regex-based abbreviation expansion |
| `get_features(queries)` | Build `[SentenceTransformer embedding | TF-IDF]` feature matrix |
| `train_models(df)` | Train 4 LogisticRegression classifiers (offline use) |
#### `IngestionService`
| Method | Description |
|---|---|
| `ingest(file_path)` | Load + chunk a file; returns `List[Document]` |
| `handle_json_docs()` | Intent-aware chunking for structured JSON data |
| `handle_text_docs()` | Recursive character splitting for unstructured text |
| `get_records()` | Delegate to `FileService.get_records()` |
| `delete_record(filename)` | Remove a file's metadata record |
| `path_record(path, metadata)` | Patch ingestion stats after indexing |
#### `FileService`
| Method | Description |
|---|---|
| `read_file(path)` | Load file content; dispatches by extension |
| `write_file(path, content, metadata)` | Persist file to `data/documents/` |
| `patch_metadata(path, metadata)` | Merge new fields into existing record |
| `get_records()` | Return all ingestion records dict |
| `delete_record(filename)` | Remove a record from `<collection>.json` |
#### `VectorStore`
Thin wrapper around `langchain_chroma.Chroma`.
| Method | Description |
|---|---|
| `get()` | Retrieve all documents |
| `get_by_id(ids)` | Retrieve specific documents by ID |
| `add_documents(docs)` | Embed + insert, skipping empty chunks |
| `update_document(id, doc)` | Delete then re-insert with same ID |
| `delete(ids)` | Remove documents by ID list |
| `similarity_search_with_score()` | Wrapped Chroma search |
### Utilities (`app/utils/`)
#### `preprocessing.py`
| Function | Description |
|---|---|
| `preprocess(text)` | spaCy POS filter + lemmatize + stopword removal β†’ joined string |
| `normalize(text)` | Tokenize + strip blanks (lightweight, no POS) |
| `preprocess_query(query)` | Applies `normalize()` to user queries |
| `preprocess_documents(docs)` | Applies `preprocess()` to a document list in-place |
| `preprocess_filename(path)` | Sanitize filename (remove special chars, lowercase) |
#### `document_helpers.py`
| Function | Description |
|---|---|
| `get_references_v2(docs, threshold)` | Convert `RetrievalResult` list β†’ references dict + context string |
| `get_references(docs, threshold)` | Same for raw `(Document, distance)` tuples (used by `query()`) |
| `build_metadata(path)` | Parse YAML frontmatter from `.md`/`.txt` files |
| `create_documents(chunks, ...)` | Attach standard metadata (UUID, timestamps, indices) to chunks |
| `create_documents_from_text(text)` | Full pipeline: frontmatter parse β†’ split β†’ metadata attach |
| `clean_metadata(metadata)` | Serialize datetime, coerce non-allowed types to string |
#### `model_factory.py`
| Function | Description |
|---|---|
| `get_embedding_model()` | Returns `GoogleGenerativeAIEmbeddings` |
| `get_gemini_model()` | Returns `ChatGoogleGenerativeAI` |
| `get_local_model()` | Returns `ChatLlamaCpp` (GGUF, CPU inference) |
| `get_llm_model(provider)` | Dispatches to Gemini or Local with fallback logic |
### API Routes (`app/api/routes/`)
#### `rag.py` β€” prefix `/api/v1/rag`
| Method | Endpoint | Description |
|---|---|---|
| GET | `/` | Health check |
| POST | `/` | Semantic query |
| POST | `/hybrid_query` | Hybrid RAG query (primary endpoint) |
| POST | `/similarity_search` | Hybrid retrieval, no LLM response |
| POST | `/search` | BM25-heavy document search |
| POST | `/test` | Batch retrieval evaluation |
| POST | `/test_classifier` | Classifier accuracy evaluation |
| GET | `/test_classifier_dataset` | Run built-in test dataset, cache result |
#### `vector_store.py` β€” prefix `/api/v1/vector`
| Method | Endpoint | Description |
|---|---|---|
| GET | `/` | List all documents (paginated, filterable) |
| GET | `/filenames` | List ingested file records |
| GET | `/{id}` | Get single document by ChromaDB ID |
| POST | `/` | Upload + ingest file |
| PUT | `/{id}` | Update document content/metadata |
| DELETE | `/ids` | Bulk delete by ID list |
| DELETE | `/{id}` | Delete single document |
| DELETE | `/` | Filter-based delete (filename/source/contains) |
### Configuration (`app/core/config.py`)
All settings are read from `.env` via Pydantic `BaseSettings`:
```python
class Settings(BaseSettings):
# Paths
collection_name: str = "classifier_test_1"
persist_directory: str = "./data/vector_stores/classifier_test_1"
# Chunking
chunk_size: int = 500
chunk_overlap: int = 100
# Retrieval
similarity_top_k: int = 8
similarity_threshold: float = 0.4
# LLM Provider
llm_provider: Literal["gemini", "local"] = "local"
enable_fallback: bool = True
# Models
embedding_model_name: str = "models/gemini-embedding-001"
gemini_model_name: str = "gemini-2.5-flash-lite"
local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf"
# Generation
max_output_tokens: int = 2048
local_max_tokens: int = 512
# Auth
google_api_key: str # required β€” must be in .env
```
---
## 6. Evaluation & Metrics
### Retrieval Evaluation (`test_queries` / `POST /api/v1/rag/test`)
Tests each (question, expected_document, expected_chunk_index) triple against `hybrid_query`:
| Metric | Formula | Interpretation |
|---|---|---|
| **Hit Rate** | `hits / total` | % of questions where the exact chunk was retrieved |
| **Top-1 Hit Rate** | `rank==1 hits / total` | % of questions where exact chunk was top result |
| **MRR** | `mean(1/rank)` | Mean Reciprocal Rank; higher = correct result ranked earlier |
| **Doc Precision** | `correct_source_chunks / all_chunks` | How many retrieved chunks came from the right document |
| **Doc Recall** | `1 if any correct_source_chunk else 0` | Did we retrieve at least one chunk from the right document? |
| **Doc Noise** | `wrong_source_chunks / all_chunks` | Proportion of off-topic chunks in the result set |
| **Error Rate** | `1 - hit_rate` | Miss rate for exact chunk retrieval |
**Test Input Schema:**
```python
class TestRequestSchema(BaseModel):
tests: List[Test] # question + document + chunk_index
k: int = 5
threshold: float = 0.4
```
### Classifier Evaluation (`test_classifier` / `POST /api/v1/rag/test_classifier`)
Evaluates predictions for all 4 classification fields (`type`, `category`, `topic`, `intent`):
| Metric | Notes |
|---|---|
| **Accuracy** | `sklearn.accuracy_score` |
| **Precision (macro)** | `zero_division=0` |
| **Recall (macro)** | `zero_division=0` |
| **F1 Macro** | Unweighted average across classes |
| **F1 Weighted** | Class-frequency weighted |
| **Classification Report** | Full per-class breakdown (`output_dict=True`) |
A bundled test dataset is stored in `app/utils/tests.py` as `classifier_test_dataset` and can be executed via `GET /api/v1/rag/test_classifier_dataset`. Results are **memoized** on the `RAGService.evaluation` dict for the lifetime of the server process.
---
## 7. Known Limitations
### Technical Debt
- **`preprocess_query` is incomplete.** The function signature has an LLM-powered query rewriting block that is commented out. Currently it just calls `normalize()` (tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents).
- **`search_docs` does not honour `filename` as a metadata filter in Chroma.** The filter is applied in Python post-retrieval, which is inefficient for large collections.
- **Count intent is synthetic.** The `"Total <topic>: N"` chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed.
- **`VectorStore.get_dict()` has a `print(type(rows))`** debug statement left in production code.
- **`FileService.__init__` docstring** has an extra backtick: `"`\`` class docstring`.
### Planned but Unimplemented
- **Query rewriting via local LLM** β€” skeleton is commented out in `preprocess_query()`.
- **Semantic caching** β€” no query result memoization at the API layer.
- **Re-ranker** β€” no cross-encoder re-ranking step; relies only on RRF + boosting.
- **`topic` field is not included in the ChromaDB hard filter** β€” only `type` + `intent` are hard-anchored; `category` and `topic` are soft `$or` hints.
### Performance Bottlenecks
- **Local LLM (LlamaCpp)** is CPU-only with `n_ctx=8096` and `n_threads=4`. Response latency is high (~10–30s) on low-RAM systems.
- **Classifier uses `SentenceTransformer` + `TF-IDF` features** β€” inference runs on every request with no caching of query embeddings.
- **BM25 corpus is rebuilt from scratch per request** β€” `BM25Retriever.from_documents()` is called inside `_bm25_rank()` each time.
- **`classify_test_dataset` in `app/utils/tests.py`** is a very large file (1.8MB) loaded at import time.
- **The memoized evaluation** in `rag_service.evaluation` is not thread-safe if the server runs with multiple workers.
---
## 8. File Structure
```
VGEC-RAG-Chatbot/
β”‚
β”œβ”€β”€ app/ # Application package
β”‚ β”œβ”€β”€ main.py # FastAPI app, router mounting, CORS middleware
β”‚ β”œβ”€β”€ core/
β”‚ β”‚ β”œβ”€β”€ config.py # Pydantic Settings (all tuneable params)
β”‚ β”‚ └── paths.py # Path constants helper
β”‚ β”‚
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ dependencies.py # lru_cache singleton for RAGService
β”‚ β”‚ β”œβ”€β”€ routes/
β”‚ β”‚ β”‚ β”œβ”€β”€ rag.py # /rag endpoints (query, test, classifier)
β”‚ β”‚ β”‚ β”œβ”€β”€ vector_store.py # /vector endpoints (CRUD for ChromaDB)
β”‚ β”‚ β”‚ └── settings.py # /settings endpoints
β”‚ β”‚ └── schemas/
β”‚ β”‚ β”œβ”€β”€ requests.py # RAGRequest, PaginationParams, etc.
β”‚ β”‚ └── tests.py # TestRequestSchema, TestClassifierReqSchema
β”‚ β”‚
β”‚ β”œβ”€β”€ services/
β”‚ β”‚ β”œβ”€β”€ rag_service.py # RAGService (main orchestrator)
β”‚ β”‚ β”œβ”€β”€ hybrid_retrieval.py # HybridRetrievalService + RRF logic
β”‚ β”‚ β”œβ”€β”€ classifier_service.py # Classifier class + singleton clf
β”‚ β”‚ β”œβ”€β”€ ingestion_service.py # IngestionService (chunking pipeline)
β”‚ β”‚ β”œβ”€β”€ file_service.py # FileService (file I/O + metadata JSON)
β”‚ β”‚ β”œβ”€β”€ vector_store.py # VectorStore (thin ChromaDB wrapper)
β”‚ β”‚ β”œβ”€β”€ text_splitter.py # TextSplitter (RecursiveCharacter + variants)
β”‚ β”‚ └── document_loader.py # (legacy loader, not in primary path)
β”‚ β”‚
β”‚ β”œβ”€β”€ utils/
β”‚ β”‚ β”œβ”€β”€ preprocessing.py # preprocess(), normalize(), preprocess_query()
β”‚ β”‚ β”œβ”€β”€ document_helpers.py # get_references_v2(), build_metadata(), create_documents()
β”‚ β”‚ β”œβ”€β”€ model_factory.py # get_llm_model(), get_embedding_model()
β”‚ β”‚ β”œβ”€β”€ constants.py # stopwords list, short_words_mappings
β”‚ β”‚ β”œβ”€β”€ embeddings.py # (thin embedding util)
β”‚ β”‚ β”œβ”€β”€ llm_models.py # (thin LLM util)
β”‚ β”‚ └── tests.py # classifier_test_dataset (large, 1.8MB)
β”‚ β”‚
β”‚ └── prompts/
β”‚ └── __init__.py # SYSTEM_PROMPT, wrap_exaone()
β”‚
β”œβ”€β”€ ml_models/
β”‚ β”œβ”€β”€ classifier/
β”‚ β”‚ └── chatbot_classifier.pkl # Pickled pipeline (models, tfidf, label encoders, etc.)
β”‚ β”œβ”€β”€ embeddings/ # (Local embedding model weights, if any)
β”‚ └── llm/
β”‚ └── EXAONE-3.5-2.4B-*.gguf # Local LLM weights
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ department_data/ # Source JSON files per department
β”‚ β”‚ β”œβ”€β”€ computer_eng.json
β”‚ β”‚ β”œβ”€β”€ civil.json
β”‚ β”‚ └── ...
β”‚ β”œβ”€β”€ documents/ # Persistent copies of ingested files
β”‚ β”œβ”€β”€ vector_stores/
β”‚ β”‚ └── classifier_test_1/ # ChromaDB persist directory
β”‚ β”œβ”€β”€ classifier_test_1.json # Ingestion metadata registry (FileService records)
β”‚ └── other_data/ # Misc data files
β”‚
β”œβ”€β”€ temp/ # Staging area for uploaded files (auto-cleared)
β”œβ”€β”€ scripts/ # Offline scripts (training, testing)
β”œβ”€β”€ tests/ # Test files
β”‚
β”œβ”€β”€ requirements.txt # Pinned production dependencies
β”œβ”€β”€ .env # Runtime secrets (google_api_key, etc.)
β”œβ”€β”€ .env.example # Template for .env
└── CODEBASE_DOCUMENTATION.md # This file
```
---
*End of documentation.*