Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / CODEBASE_DOCUMENTATION.md

harsh-dev

docker deployment

4225666 about 1 month ago

preview code

raw

history blame contribute delete

27 kB

	# VGEC RAG Chatbot — Codebase Documentation

	> Generated: 2026-03-25
	> Version: 1.0.0
	> Scope: Full system — ingestion, retrieval, classification, API, evaluation

	---

	## Table of Contents

	1. [Project Overview](#1-project-overview)
	2. [System Architecture](#2-system-architecture)
	3. [Schema & Data Model](#3-schema--data-model)
	4. [Retrieval Pipeline](#4-retrieval-pipeline)
	5. [Key Classes & Modules](#5-key-classes--modules)
	6. [Evaluation & Metrics](#6-evaluation--metrics)
	7. [Known Limitations](#7-known-limitations)
	8. [File Structure](#8-file-structure)

	---

	## 1. Project Overview

	### Purpose

	VGEC RAG Chatbot is a Retrieval-Augmented Generation (RAG) chatbot for Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat. It allows students, faculty, and visitors to query structured information about the institution — departments, faculty, syllabus, labs, intake capacity, and more — through natural language.

	### Domain

	- Institution: VGEC (Government Engineering College, Gujarat)
	- Data Coverage: Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.)
	- Topics: Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements

	### Tech Stack

	\| Layer \| Technology \|
	\|---\|---\|
	\| API Framework \| FastAPI \|
	\| Vector Database \| ChromaDB (persistent, local) \|
	\| Embeddings \| Google `gemini-embedding-001` (via `langchain-google-genai`) \|
	\| LLM (Cloud) \| Google Gemini `gemini-2.5-flash-lite` \|
	\| LLM (Local) \| `EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf` via `llama-cpp-python` \|
	\| NLP / Preprocessing \| spaCy (`en_core_web_sm`), NLTK (PorterStemmer) \|
	\| Classifier \| Scikit-learn `LogisticRegression` + `SentenceTransformer` (`MongoDB/mdbr-leaf-mt`) \|
	\| BM25 \| `langchain-community` `BM25Retriever` \|
	\| Chunking \| LangChain `RecursiveCharacterTextSplitter` \|
	\| Config \| Pydantic `BaseSettings` (`.env`-backed) \|

	### Key Features Implemented

	- ✅ Structured JSON ingestion with intent-aware chunking
	- ✅ Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF)
	- ✅ Intent/metadata classification with confidence-gated ChromaDB filters
	- ✅ Abbreviation expansion (`CE` → `Computer Engineering`, etc.)
	- ✅ Multi-turn conversation history support
	- ✅ Dual LLM backend with automatic fallback (Gemini ↔ Local)
	- ✅ Full CRUD REST API for vector store management
	- ✅ Offline evaluation endpoint (MRR, hit rate, noise rate)
	- ✅ Classifier accuracy evaluation endpoint

	---

	## 2. System Architecture

	### Component Diagram

	```
	┌──────────────────────────┐
	│ FastAPI App │
	│ /api/v1/rag /vector │
	└──────────┬───────────────┘
	│ DI (lru_cache)
	┌──────────▼───────────────┐
	│ RAGService │
	│ (core orchestrator) │
	└──┬───────────┬────────────┘
	│ │
	┌─────────────▼──┐ ┌───▼──────────────────┐
	│ IngestionService│ │ HybridRetrievalService│
	│ (write path) │ │ (read path) │
	└──────┬──────── ┘ └───┬──────────┬─────── ┘
	│ │ │
	┌──────────▼──┐ ┌──────────▼──┐ ┌────▼──────────┐
	│ FileService │ │ ClassifierSvc│ │ VectorStore │
	│ (file +meta) │ │(clf predict) │ │ (ChromaDB) │
	└──────────────┘ └─────────────┘ └───────────────┘
	```

	### Data Flow

	#### Ingestion Path

	```
	File Upload (PDF/MD/TXT/JSON)
	│
	▼
	FileService.read_file() ← type-aware loading (PyMuPDF for PDF)
	│ returns: Document + metadata
	▼
	FileService.write_file() ← persist copy to data/documents/
	│
	▼
	IngestionService.handle_*_docs() ← route by file extension
	│
	├─ JSON → handle_json_docs() ← intent-aware chunks (list / detail / count)
	└─ text → handle_text_docs() ← RecursiveCharacterTextSplitter + normalize()
	│
	▼
	VectorStore.add_documents() ← embed + upsert into ChromaDB
	│
	▼
	FileService.patch_metadata() ← update ingestion record JSON (chunk count, timing, size)
	```

	#### Query Path

	```
	User Question
	│
	▼
	preprocess_query() ← tokenize + strip stopwords (spaCy) + normalize
	│
	▼
	HybridRetrievalService.retrieve()
	│
	├─ clf.expand_abbreviations() ← CE → Computer Engineering
	├─ clf.predict_with_filter() ← LogReg predict → Chroma $and/$or filter
	├─ _vector_rank() ← ChromaDB similarity_search_with_score (k=15)
	├─ _bm25_rank() ← BM25 over the vector candidate pool
	├─ _reciprocal_rank_fusion() ← weighted RRF merge
	├─ metadata score boosting ← multiply fused scores for confident matches
	└─ _apply_title_boost() ← per-query-word title match bonus
	│
	▼
	get_references_v2() ← filter by threshold, build context string
	│
	▼
	LLM.invoke(prompt) ← Gemini or local LlamaCpp
	│
	▼
	Return: { answer, references, context, threshold_used, k_used }
	```

	### External Dependencies

	\| Dependency \| Role \| Provider \|
	\|---\|---\|---\|
	\| ChromaDB \| Persistent vector store \| Local disk \|
	\| Google Gemini API \| Embeddings + LLM generation \| Google Cloud \|
	\| LlamaCpp (GGUF model) \| Local LLM fallback \| Local CPU \|
	\| Sentence Transformers \| Classifier feature extraction \| HuggingFace Hub \|
	\| spaCy `en_core_web_sm` \| POS tagging / lemmatization \| Local \|

	---

	## 3. Schema & Data Model

	### Source JSON Format

	Source data files (e.g. `computer_eng.json`) follow this schema:

	```json
	{
	"id": "computer-engineering-department",
	"name": "Computer Engineering Department",
	"source": "https://www.vgecg.ac.in/department.php?dept=3",
	"category": "computer_eng",
	"type": "department",
	"created_date": "2026-02-19",
	"content": {
	"<topic_key>": {
	"list": ["item 1", "item 2", "..."],
	"details": "Paragraph describing the topic."
	}
	}
	}
	```

	Top-level fields:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `id` \| string \| Unique document identifier \|
	\| `name` \| string \| Human-readable institution/department name \|
	\| `source` \| string \| Authoritative URL \|
	\| `category` \| string \| Department slug (e.g. `computer_eng`) \|
	\| `type` \| string \| Document type (e.g. `department`) \|
	\| `created_date` \| string (ISO) \| Data creation date \|
	\| `content` \| object \| Topic map; each key = a topic \|

	### Chunk Metadata Schema (stored in ChromaDB)

	Every vector chunk stored in Chroma carries the following metadata:

	\| Field \| Type \| Source \|
	\|---\|---\|---\|
	\| `id` \| string (UUID) \| Auto-generated \|
	\| `title` \| string \| Document name / topic key \|
	\| `source` \| string \| Source URL \|
	\| `source_file` \| string \| Filename (e.g. `computer_eng.json`) \|
	\| `type` \| string \| Taxonomy level 1 (e.g. `department`) \|
	\| `category` \| string \| Taxonomy level 2 (e.g. `computer_eng`) \|
	\| `topic` \| string \| Taxonomy level 3 (e.g. `faculty`) \|
	\| `intent` \| string \| Chunk intent: `list`, `detail`, or `count` \|
	\| `chunk_index` \| int \| Sequential index within file \|
	\| `created_date` \| string (ISO) \| Ingestion timestamp \|
	\| `updated_at` \| string (ISO) \| Last modification timestamp \|
	\| `ext` \| string \| Source file extension (`json`, `pdf`, `md`, `txt`) \|

	### Hierarchical Taxonomy

	The classifier predicts and ChromaDB filters operate on a 3-level hierarchy:

	```
	type
	└── category
	└── topic
	└── intent (list \| detail \| count)
	```

	Example mapping (Computer Engineering):

	```
	type: "department"
	└── category: "computer_eng"
	├── topic: "faculty" → intent: list \| detail
	├── topic: "lab" → intent: list \| detail
	├── topic: "syllabus" → intent: list \| detail
	├── topic: "hod" → intent: list \| detail
	├── topic: "intake" → intent: list \| detail
	├── topic: "research" → intent: list \| detail
	└── topic: "achievements"
	```

	### Document Chunking Strategy

	JSON documents use a hand-crafted, intent-aware strategy in `IngestionService.handle_json_docs()`:

	\| Intent \| Chunk Content \| Metadata \|
	\|---\|---\|---\|
	\| `list` \| Numbered list: `1. item\n2. item\n...` \| `intent=list` \|
	\| `count` \| `"Total <topic>: N"` (auto-generated) \| `intent=count` \|
	\| `detail` \| Raw paragraph text \| `intent=detail` \|

	Text/PDF/Markdown documents use `RecursiveCharacterTextSplitter`:
	- Default: `chunk_size=500`, `chunk_overlap=100`
	- Separator priority: `\n\n` → `\n` → ` ` → (character)
	- Markdown variant respects `---` section delimiters
	- Content is passed through `normalize()` (tokenize + strip blanks) before storage

	---

	## 4. Retrieval Pipeline

	### Query Processing Flow

	```python
	# Step 1: Normalize input
	question = preprocess_query(question)
	# → spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords

	# Step 2: Expand abbreviations
	processed_query = clf.expand_abbreviations(query)
	# → "CE dept" → "computer engineering department"

	# Step 3: Classify intent/metadata
	filters = clf.predict_with_filter([processed_query])
	# → {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]}

	# Step 4: Vector search with optional filter
	raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters)
	# Fallback: if filtered results empty, retry without filter

	# Step 5: BM25 re-rank over vector candidates
	bm25_results = BM25Retriever.from_documents(candidate_docs)

	# Step 6: RRF fusion
	fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
	+ vector_weight * 1/(rrf_k + rank_vec)

	# Step 7: Metadata confidence boosting
	if doc.metadata[field] == predicted_val and conf > 0.90:
	result.fused_score *= boost_factor # 1.10–1.20

	# Step 8: Title word boost
	for word in query_words:
	if word in doc.title:
	result.fused_score += title_boost_per_word # 0.004

	# Step 9: Threshold filter + sort + top-k
	results = [r for r in results if r.fused_score >= threshold]
	```

	### Classifier Thresholds

	The `Classifier` uses two separate threshold tables:

	Prediction threshold — below this, the field is set to `None` (not used at all):

	\| Field \| Threshold \|
	\|---\|---\|
	\| `type` \| 0.40 \|
	\| `category` \| 0.40 \|
	\| `topic` \| 0.50 \|
	\| `intent` \| 0.60 \|

	Filter threshold — above this, the field becomes a hard ChromaDB `$and` filter:

	\| Field \| Threshold \|
	\|---\|---\|
	\| `type` \| 0.65 \|
	\| `category` \| 0.65 \|
	\| `topic` \| 0.70 \|

	### Filter Construction Logic (`_build_filter`)

	```python
	# Gate: if type confidence < 0.65 → return None (full scan)
	# Hard anchors (always included if type passes):
	# - type == predicted_type
	# - intent == predicted_intent (special: "count" expands to count OR detail)
	# Soft hints (combined as $or):
	# - category == predicted_category (if conf >= 0.65, else "general")
	# - topic == predicted_topic (if conf >= 0.70, else "general")
	```

	### Hybrid Retrieval Config (Defaults)

	\| Parameter \| `hybrid_query` \| `search_docs` \|
	\|---\|---\|---\|
	\| `candidate_k` \| 15 \| 15 \|
	\| `top_k` (final) \| `settings.similarity_top_k` (8) \| k (param) \|
	\| `bm25_weight` \| 0.45 \| 0.70 \|
	\| `vector_weight` \| 0.55 \| 0.30 \|
	\| `rrf_k` \| 20 \| 20 \|
	\| `bm25_k1` \| 1.2 \| 1.5 \|
	\| `bm25_b` \| 0.9 \| 0.75 \|
	\| `title_boost_per_word` \| 0.004 \| 0.004 \|
	\| `score_threshold` \| 0.4 \| 0.4 \|

	> Note: `search_docs` is BM25-heavy (0.70) since it is used for keyword-oriented document browsing, while `hybrid_query` is vector-heavy for semantic QA.

	---

	## 5. Key Classes & Modules

	### Services (`app/services/`)

	#### `RAGService`

	Main orchestrator. Singleton via `lru_cache` in `dependencies.py`.

	\| Method \| Description \|
	\|---\|---\|
	\| `query()` \| Semantic-only QA (vector search → LLM) \|
	\| `hybrid_query()` \| Hybrid QA (BM25 + vector → RRF → LLM) \|
	\| `search_docs()` \| BM25-heavy document search, no LLM \|
	\| `ingest_documents()` \| Ingest a file path into the vector store \|
	\| `get_filenames()` \| Return all tracked file metadata records \|
	\| `test_queries()` \| Batch retrieval evaluation (MRR, precision, noise) \|
	\| `test_classifier()` \| Batch classifier accuracy evaluation \|
	\| `delete_database()` \| Drop the entire ChromaDB collection \|

	#### `HybridRetrievalService`

	Stateless per-request service created inline by `RAGService`.

	\| Method \| Description \|
	\|---\|---\|
	\| `retrieve(query)` \| Full hybrid retrieval pipeline; returns `List[RetrievalResult]` \|
	\| `_vector_rank()` \| Chroma similarity search + classifier filter \|
	\| `_bm25_rank()` \| BM25 over candidate pool \|
	\| `_reciprocal_rank_fusion()` \| Merge both ranked lists via RRF \|
	\| `_apply_title_boost()` \| Word-level title match score bonus \|

	`RetrievalResult` dataclass:

	```python
	@dataclass
	class RetrievalResult:
	document: Document
	fused_score: float
	bm25_rank: Optional[int]
	vector_rank: Optional[int]
	title_boost: float
	```

	#### `Classifier`

	Loaded at startup from a pickled pipeline (`chatbot_classifier.pkl`).

	\| Method \| Description \|
	\|---\|---\|
	\| `predict(queries)` \| Returns list of `{type, category, topic, intent, *_conf}` dicts \|
	\| `predict_with_filter(queries)` \| Returns a ChromaDB-compatible filter dict or `None` \|
	\| `expand_abbreviations(text)` \| Regex-based abbreviation expansion \|
	\| `get_features(queries)` \| Build `[SentenceTransformer embedding \| TF-IDF]` feature matrix \|
	\| `train_models(df)` \| Train 4 LogisticRegression classifiers (offline use) \|

	#### `IngestionService`

	\| Method \| Description \|
	\|---\|---\|
	\| `ingest(file_path)` \| Load + chunk a file; returns `List[Document]` \|
	\| `handle_json_docs()` \| Intent-aware chunking for structured JSON data \|
	\| `handle_text_docs()` \| Recursive character splitting for unstructured text \|
	\| `get_records()` \| Delegate to `FileService.get_records()` \|
	\| `delete_record(filename)` \| Remove a file's metadata record \|
	\| `path_record(path, metadata)` \| Patch ingestion stats after indexing \|

	#### `FileService`

	\| Method \| Description \|
	\|---\|---\|
	\| `read_file(path)` \| Load file content; dispatches by extension \|
	\| `write_file(path, content, metadata)` \| Persist file to `data/documents/` \|
	\| `patch_metadata(path, metadata)` \| Merge new fields into existing record \|
	\| `get_records()` \| Return all ingestion records dict \|
	\| `delete_record(filename)` \| Remove a record from `<collection>.json` \|

	#### `VectorStore`

	Thin wrapper around `langchain_chroma.Chroma`.

	\| Method \| Description \|
	\|---\|---\|
	\| `get()` \| Retrieve all documents \|
	\| `get_by_id(ids)` \| Retrieve specific documents by ID \|
	\| `add_documents(docs)` \| Embed + insert, skipping empty chunks \|
	\| `update_document(id, doc)` \| Delete then re-insert with same ID \|
	\| `delete(ids)` \| Remove documents by ID list \|
	\| `similarity_search_with_score()` \| Wrapped Chroma search \|

	### Utilities (`app/utils/`)

	#### `preprocessing.py`

	\| Function \| Description \|
	\|---\|---\|
	\| `preprocess(text)` \| spaCy POS filter + lemmatize + stopword removal → joined string \|
	\| `normalize(text)` \| Tokenize + strip blanks (lightweight, no POS) \|
	\| `preprocess_query(query)` \| Applies `normalize()` to user queries \|
	\| `preprocess_documents(docs)` \| Applies `preprocess()` to a document list in-place \|
	\| `preprocess_filename(path)` \| Sanitize filename (remove special chars, lowercase) \|

	#### `document_helpers.py`

	\| Function \| Description \|
	\|---\|---\|
	\| `get_references_v2(docs, threshold)` \| Convert `RetrievalResult` list → references dict + context string \|
	\| `get_references(docs, threshold)` \| Same for raw `(Document, distance)` tuples (used by `query()`) \|
	\| `build_metadata(path)` \| Parse YAML frontmatter from `.md`/`.txt` files \|
	\| `create_documents(chunks, ...)` \| Attach standard metadata (UUID, timestamps, indices) to chunks \|
	\| `create_documents_from_text(text)` \| Full pipeline: frontmatter parse → split → metadata attach \|
	\| `clean_metadata(metadata)` \| Serialize datetime, coerce non-allowed types to string \|

	#### `model_factory.py`

	\| Function \| Description \|
	\|---\|---\|
	\| `get_embedding_model()` \| Returns `GoogleGenerativeAIEmbeddings` \|
	\| `get_gemini_model()` \| Returns `ChatGoogleGenerativeAI` \|
	\| `get_local_model()` \| Returns `ChatLlamaCpp` (GGUF, CPU inference) \|
	\| `get_llm_model(provider)` \| Dispatches to Gemini or Local with fallback logic \|

	### API Routes (`app/api/routes/`)

	#### `rag.py` — prefix `/api/v1/rag`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| GET \| `/` \| Health check \|
	\| POST \| `/` \| Semantic query \|
	\| POST \| `/hybrid_query` \| Hybrid RAG query (primary endpoint) \|
	\| POST \| `/similarity_search` \| Hybrid retrieval, no LLM response \|
	\| POST \| `/search` \| BM25-heavy document search \|
	\| POST \| `/test` \| Batch retrieval evaluation \|
	\| POST \| `/test_classifier` \| Classifier accuracy evaluation \|
	\| GET \| `/test_classifier_dataset` \| Run built-in test dataset, cache result \|

	#### `vector_store.py` — prefix `/api/v1/vector`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| GET \| `/` \| List all documents (paginated, filterable) \|
	\| GET \| `/filenames` \| List ingested file records \|
	\| GET \| `/{id}` \| Get single document by ChromaDB ID \|
	\| POST \| `/` \| Upload + ingest file \|
	\| PUT \| `/{id}` \| Update document content/metadata \|
	\| DELETE \| `/ids` \| Bulk delete by ID list \|
	\| DELETE \| `/{id}` \| Delete single document \|
	\| DELETE \| `/` \| Filter-based delete (filename/source/contains) \|

	### Configuration (`app/core/config.py`)

	All settings are read from `.env` via Pydantic `BaseSettings`:

	```python
	class Settings(BaseSettings):
	# Paths
	collection_name: str = "classifier_test_1"
	persist_directory: str = "./data/vector_stores/classifier_test_1"

	# Chunking
	chunk_size: int = 500
	chunk_overlap: int = 100

	# Retrieval
	similarity_top_k: int = 8
	similarity_threshold: float = 0.4

	# LLM Provider
	llm_provider: Literal["gemini", "local"] = "local"
	enable_fallback: bool = True

	# Models
	embedding_model_name: str = "models/gemini-embedding-001"
	gemini_model_name: str = "gemini-2.5-flash-lite"
	local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf"

	# Generation
	max_output_tokens: int = 2048
	local_max_tokens: int = 512

	# Auth
	google_api_key: str # required — must be in .env
	```

	---

	## 6. Evaluation & Metrics

	### Retrieval Evaluation (`test_queries` / `POST /api/v1/rag/test`)

	Tests each (question, expected_document, expected_chunk_index) triple against `hybrid_query`:

	\| Metric \| Formula \| Interpretation \|
	\|---\|---\|---\|
	\| Hit Rate \| `hits / total` \| % of questions where the exact chunk was retrieved \|
	\| Top-1 Hit Rate \| `rank==1 hits / total` \| % of questions where exact chunk was top result \|
	\| MRR \| `mean(1/rank)` \| Mean Reciprocal Rank; higher = correct result ranked earlier \|
	\| Doc Precision \| `correct_source_chunks / all_chunks` \| How many retrieved chunks came from the right document \|
	\| Doc Recall \| `1 if any correct_source_chunk else 0` \| Did we retrieve at least one chunk from the right document? \|
	\| Doc Noise \| `wrong_source_chunks / all_chunks` \| Proportion of off-topic chunks in the result set \|
	\| Error Rate \| `1 - hit_rate` \| Miss rate for exact chunk retrieval \|

	Test Input Schema:

	```python
	class TestRequestSchema(BaseModel):
	tests: List[Test] # question + document + chunk_index
	k: int = 5
	threshold: float = 0.4
	```

	### Classifier Evaluation (`test_classifier` / `POST /api/v1/rag/test_classifier`)

	Evaluates predictions for all 4 classification fields (`type`, `category`, `topic`, `intent`):

	\| Metric \| Notes \|
	\|---\|---\|
	\| Accuracy \| `sklearn.accuracy_score` \|
	\| Precision (macro) \| `zero_division=0` \|
	\| Recall (macro) \| `zero_division=0` \|
	\| F1 Macro \| Unweighted average across classes \|
	\| F1 Weighted \| Class-frequency weighted \|
	\| Classification Report \| Full per-class breakdown (`output_dict=True`) \|

	A bundled test dataset is stored in `app/utils/tests.py` as `classifier_test_dataset` and can be executed via `GET /api/v1/rag/test_classifier_dataset`. Results are memoized on the `RAGService.evaluation` dict for the lifetime of the server process.

	---

	## 7. Known Limitations

	### Technical Debt

	- `preprocess_query` is incomplete. The function signature has an LLM-powered query rewriting block that is commented out. Currently it just calls `normalize()` (tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents).
	- `search_docs` does not honour `filename` as a metadata filter in Chroma. The filter is applied in Python post-retrieval, which is inefficient for large collections.
	- Count intent is synthetic. The `"Total <topic>: N"` chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed.
	- `VectorStore.get_dict()` has a `print(type(rows))` debug statement left in production code.
	- `FileService.__init__` docstring has an extra backtick: `"`\`` class docstring`.

	### Planned but Unimplemented

	- Query rewriting via local LLM — skeleton is commented out in `preprocess_query()`.
	- Semantic caching — no query result memoization at the API layer.
	- Re-ranker — no cross-encoder re-ranking step; relies only on RRF + boosting.
	- `topic` field is not included in the ChromaDB hard filter — only `type` + `intent` are hard-anchored; `category` and `topic` are soft `$or` hints.

	### Performance Bottlenecks

	- Local LLM (LlamaCpp) is CPU-only with `n_ctx=8096` and `n_threads=4`. Response latency is high (~10–30s) on low-RAM systems.
	- Classifier uses `SentenceTransformer` + `TF-IDF` features — inference runs on every request with no caching of query embeddings.
	- BM25 corpus is rebuilt from scratch per request — `BM25Retriever.from_documents()` is called inside `_bm25_rank()` each time.
	- `classify_test_dataset` in `app/utils/tests.py` is a very large file (1.8MB) loaded at import time.
	- The memoized evaluation in `rag_service.evaluation` is not thread-safe if the server runs with multiple workers.

	---

	## 8. File Structure

	```
	VGEC-RAG-Chatbot/
	│
	├── app/ # Application package
	│ ├── main.py # FastAPI app, router mounting, CORS middleware
	│ ├── core/
	│ │ ├── config.py # Pydantic Settings (all tuneable params)
	│ │ └── paths.py # Path constants helper
	│ │
	│ ├── api/
	│ │ ├── dependencies.py # lru_cache singleton for RAGService
	│ │ ├── routes/
	│ │ │ ├── rag.py # /rag endpoints (query, test, classifier)
	│ │ │ ├── vector_store.py # /vector endpoints (CRUD for ChromaDB)
	│ │ │ └── settings.py # /settings endpoints
	│ │ └── schemas/
	│ │ ├── requests.py # RAGRequest, PaginationParams, etc.
	│ │ └── tests.py # TestRequestSchema, TestClassifierReqSchema
	│ │
	│ ├── services/
	│ │ ├── rag_service.py # RAGService (main orchestrator)
	│ │ ├── hybrid_retrieval.py # HybridRetrievalService + RRF logic
	│ │ ├── classifier_service.py # Classifier class + singleton clf
	│ │ ├── ingestion_service.py # IngestionService (chunking pipeline)
	│ │ ├── file_service.py # FileService (file I/O + metadata JSON)
	│ │ ├── vector_store.py # VectorStore (thin ChromaDB wrapper)
	│ │ ├── text_splitter.py # TextSplitter (RecursiveCharacter + variants)
	│ │ └── document_loader.py # (legacy loader, not in primary path)
	│ │
	│ ├── utils/
	│ │ ├── preprocessing.py # preprocess(), normalize(), preprocess_query()
	│ │ ├── document_helpers.py # get_references_v2(), build_metadata(), create_documents()
	│ │ ├── model_factory.py # get_llm_model(), get_embedding_model()
	│ │ ├── constants.py # stopwords list, short_words_mappings
	│ │ ├── embeddings.py # (thin embedding util)
	│ │ ├── llm_models.py # (thin LLM util)
	│ │ └── tests.py # classifier_test_dataset (large, 1.8MB)
	│ │
	│ └── prompts/
	│ └── __init__.py # SYSTEM_PROMPT, wrap_exaone()
	│
	├── ml_models/
	│ ├── classifier/
	│ │ └── chatbot_classifier.pkl # Pickled pipeline (models, tfidf, label encoders, etc.)
	│ ├── embeddings/ # (Local embedding model weights, if any)
	│ └── llm/
	│ └── EXAONE-3.5-2.4B-*.gguf # Local LLM weights
	│
	├── data/
	│ ├── department_data/ # Source JSON files per department
	│ │ ├── computer_eng.json
	│ │ ├── civil.json
	│ │ └── ...
	│ ├── documents/ # Persistent copies of ingested files
	│ ├── vector_stores/
	│ │ └── classifier_test_1/ # ChromaDB persist directory
	│ ├── classifier_test_1.json # Ingestion metadata registry (FileService records)
	│ └── other_data/ # Misc data files
	│
	├── temp/ # Staging area for uploaded files (auto-cleared)
	├── scripts/ # Offline scripts (training, testing)
	├── tests/ # Test files
	│
	├── requirements.txt # Pinned production dependencies
	├── .env # Runtime secrets (google_api_key, etc.)
	├── .env.example # Template for .env
	└── CODEBASE_DOCUMENTATION.md # This file
	```

	---

	End of documentation.